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This  thesis  studies  the  minimum  area  A  —  u.  (T )  required  by  a  layout  of  a  VLSI  circuit  that 
sorts  n  k  -bit  keys  in  time  T . 

The  square  tessellation  technique  is  introduced  as  a  powerful  tool  to  establish  area-time  lower 
bounds,  based  on  the  information  exchanged  across  the  boundary  of  a  suitable  set  of  square  cells  that 
tessellate  the  layout  region.  When  the  information  exchange  is  due  to  the  fact  that  variables  output  on 
one  side  of  the  cell  boundary  are  functions  of  variables  input  on  the  other  side,  the  square  tessellation 
yields  bounds  on  the  AT2  measure.  When,  on  the  other  hand,  the  information  exchange  is  due  to  the 
fact  that  the  cell  saturates  its  storage  resources  and  sends  some  information  outside  for  temporary 
storage,  the  square  tessellation  yields  bounds  on  the  AT  measure.  Both  AT2  and  AT  lower  bounds  are 
obtained  for  sorting.  The  former  dominate  in  fast  computations,  while  the  latter  dominate  in  slow 
computations. 

The  analysis  indicates  that  the  nature  of  the  problem  varies  considerably  with  the  relative  size  of 
n  and  k  ,  and  suggests  a  classification  of  keys  into  short  (k  ^  logn ),  long  (*  ^  2 logn  ),  and  of  medium 
length. 

Optimal  or  near-optimal  designs  of  VLSI  sorters  are  proposed  for  the  entire  range  of  n ,  k  ,  and  T , 
confirming  the  inherent  validity  of  the  lower-bound  analysis. 
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CHAPTER  1 


INTRODUCTION 


1.1  VLSI  COMPUTATION 

The  breakthroughs  in  the  field  of  electronic  devices,  which  have  lead  to  Very-Large-Scale- 
Iruegrarion  (VLSI)  technology,  open  new  avenues  to  the  system  designer  in  almost  all  areas  of  electri¬ 
cal  engineering  [MC79,  Mu82]>  New  system-theoretic  concepts  are  necessary  to  take  full  advantage  of 
the  new  technological  potential,  although  existing  theories  will  be  an  invaluable  starting  point,  accord¬ 
ing  to  a  pattern  typical  in  scientific  research. 

We  shall  focus  on  computing  systems,  which  will  be  among  the  first  to  be  affected  by  the  VLSI 
technological  revolution.  However,  considering  that  communication  and  control  systems  make  an 
increasing  use  of  digital  techniques  for  signal  processing  (a  particular  kind  of  computationX  we  realize 
that  computing  is  fundamental  for  all  electrical  engineering. 

The  main  feature  that  makes  VLSI  a  very  attractive  environment  for  computing  systems  is  the 
possibility  to  deploy  -  at  a  reasonable  cost  -  a  large  number  of  procesors  cooperating  in  the  execution  of 
a  given  task.  This  possibility  has  been  long  pursued  in  the  hope  of  increasing  the  system’s  computa¬ 
tional  throughput  by  means  of  concurrency  of  operations. 

A  sytematic  development  of  the  notion  of  concurrency  implies  a  radical  departure  from  the  archi¬ 
tecture  of  the  traditional  Von  Neumann  computer,  and  from  the  sequential  nature  of  the  corresponding 
algorithms,  given  as  sequences  of  very  elementary  instructions  each  of  which  is  to  be  executed  in  suc¬ 
cession  by  the  same  processor.  The  departure  from  the  uni-processor  architecture  poses  the  fundamental 
question  of  how  to  interconnect  many  processors  so  that  they  can  efficiently  exchange  information 
when  cooperating  in  solving  a  given  problem.  The  interconnection  network  is  in  fact  the  most  relevant 
feature  of  a  parallel  architecture  and  strongly  constrains  its  computational  capabilities.  A  formal  way 


to  view  a  parallel  architecture  consists  of  associating  to  it  a  graph  whose  vertices  correspond  to  proces¬ 
sors  and  whose  arcs  correspond  to  data  path.  The  attempt  to  define  a  general  purpose  architecture, 
whose  interconnection  network  can  support  any  processor-to-processor  data  transfer  that  an  algorithm 
may  require,  leads  to  consider  the  fully  interconnected  graph.  This  architecture,  under  an  equivalent 
formulation  known  as  Shared  Memory  Machine,  has  in  fact  been  extensively  used  in  the  first  theoreti¬ 
cal  studies  of  parallel  computing.  However,  practical  considerations  on  limited  fan-in  and  fan-out  fac¬ 
tors,  and  also  on  cost-effectiveness,  show  that  the  fully  interconnected  architecture  is  not  a  realistic 
model  for  a  computer,  and  motivate  the  investigation  of  architectures  with  simpler  interconnection 
that  still  support  efficiently  the  execution  of  parallel  algorithms.  Thus  we  are  led  to  the  following 
situation.  Each  computational  problem  calls  for  the  joint  design  of  an  algorithm  and  an  architecture. 
The  ‘best’  architecture  may  change  with  the  problem.  Only  a  posteriori  after  careful  analysis  of  many 
problems,  may  we  find  out  whether  there  are  general-purpose,  or  at  least  broad-purpose  architectures, 
which  are  efficient  for  a  large  class  of  problems. 

In  this  context  an  appraisal  of  a  design  must  be  based  not  only  on  algorithmic  performance,  typi¬ 
cally  characterized  by  time  complexity,  but  also  on  some  other  measure  capturing  the  “architectural 
complexity”.  The  traditional  count  of  processors  is  not  an  adequate  measure  because  it  totally  disre¬ 
gards  the  communication  aspects  of  the  system.  Other  mathematically  reasonable  candidates  could  be 
related  to  the  number  of  edges,  or  to  the  maximum  degree,  or  to  the  diameter  of  the  interconnection 
graph.  However,  none  of  these  measures  seems  to  reflect  completely  the  cost  of  actually  building  the 
architecture  in  any  technology  of  current  interest. 

It  is  then  of  the  greatest  theoretical  interest  the  fact  that  VLSI  technology  naturally  offers  an 
attractive  measure  of  architectural  complexity,  the  chip  area.  Due  to  the  integrated  nature  of  VLSI 
technology,  where  processing  elements  (transistors)  and  communication  elements  (wires)  are  realized  in 
the  same  medium  (the  silicon  chip),  chip  area  effectively  accounts  for  the  cost  of  all  relevant  aspects  of 
the  system,  and  its  minimization  is  a  major  concern  in  industrial  applications. 
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The  fact  that  the  architecture  to  solve  a  problem  is  not  given  in  advance  -  as  it  was  in  traditional 
algorithm  design  for  the  Von  Neumann  machine  -  brings  a  new  interesting  consequence:  the  architec¬ 
tural  complexity  can  be  traded  for  the  efficiency  of  the  computation.  In  the  context  of  VLSI  computa¬ 
tion,  this  phenomenon  takes  the  form  of  area-time  trade-off,  and  plays  a  central  role  in  the  theory. 

The  rigorous  development  of  a  computation  theory  for  VLSI  rests  on  the  definition  of  a  model  of 
computation  that  captures  the  essential  traits  of  the  technology  and  allows  for  mathematical  treatment 
of  system  design.  A  VLSI  model  of  computation  is  today  available,  as  the  result  of  the  effort  of  several 
authors  [T80JBK81,Se73,Sa79,CM81,BPP82*BI-S4l  and  will  be  described  in  the  next  section.  The 
assumptions  of  the  model  will  be  stated  axiomatically.  For  their  justification,  which  is  sometimes  based 
on  a  rather  delicate  and  subtle  analysis,  we  refer  the  reader  to  the  literature  cited  above. 

1.1.1  The  VLSI  Model  of  Computation 

A  VLSI  chip  can  be  viewed  as  a  computation  graph  whose  vertices  are  information  processing 
devices  and  whose  arcs  are  wires,  that  is,  electrical  connections  responsible  for  information  transfer  as 
well  as  for  power  supply  and  distribution  of  timing  waveforms.  A  given  computation  graph  is  to  be 
laid  out  in  conformity  with  the  rules  dictated  by  technology.  The  essence  of  these  rules  is  formally 
accounted  for  in  the  model  as  follows: 

Area  .Assumptions 

(Al)  (.Wire  Area'r.  All  wires  have  minimum  width  \  >  0  (which  includes  both  the  actual  wire  widtn 
and  the  clearance  between  wire  and  any  other  chip  region),  and  at  most  v  wires  (v  an  integer 
^  2)  can  overlap  at  any  point  (hypothesis  of  bounded  number  of  layers). 

'  A2)  (Transistor- Port  .Area):  Transistors  and  I/O  ports  have  minimum  areas  cr  K2  and  c=  respec¬ 
tively,  for  constants  cT  and  cP . 

1  A3'  (Chip  Area h  The  chip  area  is  at  least  the  sum  of  the  area  of  the  wires,  of  the  transistors,  and  of 
the  L  0  pons,  and  it  is  at  most  the  area  of  the  smallest  rectangle  (or  convex  region)  enclosing  a 
legal  layout  of  the  graph. 


The  area  assumptions  allow  a  straightforward  appraisal  of  the  area  of  any  given  design.  To  appraise 
the  computation  time  of  an  algorithm  we  need  some  assumption  on  the  timing  of  elementary  actions,  as 
the  gate  switching  and  the  signal  transmission  on  wires.  For  simplicity,  in  the  sequel  switching  time  is 
subsumed  under  propagation  time. 

Time  Assumptions 

(Tl)  ( Propagation  Time  Along  a  Wire):  A  bit  requires  a  constant  time  r  to  propagate  along  a  wire, 
irrespective  of  its  length  ( synchronous  model). 

(T2)  ( Algorithm  Time):  The  computation  time  of  an  algorithm  is  the  time  of  the  longest  sequence  of 
wire  propagation  times  between  beginning  and  completion  of  the  computation. 

Assumption  (Tl)  is  not  immediate  to  jutify,  and  it  is  in  fact  false  at  the  physical  level.  The  essence  of 
the  Justification  is  that,  although  a  detailed  analysis  of  the  electric  phenomenon  of  wire  propagation 
[8PP82]  shows  that  constant  transmission  time  can  be  achieved  on  long  wires  only  if  proportionately 
large  driving  transistors  are  deployed,  results  of  layout  theory  [BL84]  ensure  that  a  layout  of  the  com¬ 
putation  graph  can  always  be  found  in  which  large  drivers  can  be  accommodated  without  substantial 
degradation  of  area  and  time  performance. 

The  transfer  of  information  within  the  chip  is  constrained  not  only  by  wire  bandwidth,  but  also 
by  fan-in  and  fan-out  capabilities  of  logic  gates  This  fact  is  accounted  for  by  the  following  assump¬ 
tions 

Fan-in  and  Fan-out  Assumptions 

FI)  (Bounded  Fan-in).  The  number  of  input  lines  of  a  logic  gate  is  upper  bounded  by  a  constant  f  • . 
(F2)  ( Bounded  Fan-out):  The  number  of  output  lines  of  a  logic  gate  is  lower  bounded  by  a  constant 

fo- 

Other  assumptions  are  often  stated  in  the  VLSI  computation  literature  when  studying  lower  and 
upper  bounds  for  specific  problems  These  assumptions  are  not  dictated  by  technological  constraints  but 
rather  by  reasons  of  various  kinds  for  instance  to  avoid  trivial  or  meaningless  solutions  to  enforce 
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features  that  are  appealing  for  practical  application,  or  to  simplify  the  analysis.  Most  of  these  auxiliary 
assumptions  concern  the  I/O  protocol.  We  list  here  the  most  common  ones: 

Protocol  Assumptions 

(Pi)  ( Semellective  Protocol):  The  input  data  of  the  problem  are  available  only  once  at  the  input  ports. 

(P2)  ( Time-Determinate  Protocol ):  Input  and  output  data  are  available  at  prespecified  (instance 
independent)  time. 

(P3)  ( Place-Determinate  Protocol ):  Input  and  output  data  are  available  at  prespecified  (instance 
independent)  ports. 

(P4)  ( Boundary  Protocol):  All  I/O  ports  are  on  the  boundary  of  the  layout  region. 

(P5)  ( Word-Local  Protocol):  All  the  bits  of  a  given  input  word  enter  the  chip  at  the  same  input  port. 

Unless  explicitly  stated  otherwise,  assumptions  on  area,  time,  fan-in  and  fan-out,  assumptions  PI, 
P2,  and  P3  on  I/O  protocols  will  hold  throughout  this  thesis  .  Instead,  P4  and  P5  will  always  be  expli¬ 
citly  mentioned  when  adopted. 

It  is  worth  observing  that,  although  all  our  networks  will  exhibit  bounded  fan-in  and  bounded 
fan-out,  assumptions  Fl  and  F2  will  not  be  needed  in  most  of  our  lower-bound  proofs. 

Usually,  when  discussing  asymptotic  analysis,  the  specific  values  of  some  of  the  constants  in  the 
model,  such  as  cT ,  c? ,  X.  and  r,  are  not  relevant,  and  can  all  be  conventionally  chosen  equal  to  one. 

It  is  also  convenient,  when  considering  layouts  of  computation  graphs,  to  restrict  the  attention  to 
embeddings  on  a  suitable  rectangular  grid.  Generally,  this  restriction  could  be  easily  removed  at  the 
price  of  more  elaborate  proofs,  which  wouid  not  add  particular  insight  to  the  analysis. 

1.1.2  The  VLSI  Complexity  of  a  Computational  Problem 

Once  a  model  of  computation  for  VLSI  is  defined,  algorithms  for  various  problems  can  be  proposed 
and  analyzed,  and  a  coherent  theory  can  be  developed.  Several  authors  have  proposed  performance 
measures,  typically  a  function  of  the  area  A  and  of  the  time  T  of  the  form  .AT  °,  with  respect  to  which 
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optimality  can  be  defined.  In  our  opinion,  however,  the  following  approach  is  more  fundamental. 

Given  a  computation  problem  II.  to  any  chip  that  solves  II  for  an  input  of  size  n,  in  time  7  r*  and 
area  A&  we  can  associate  a  point  of  coordinates  (7  <*.4. 0)  in  a  plane,  which  we  call  time-area  plane.  The 
set  of  all  designs  corresponds  then  to  a  region  in  this  plane.  The  objective  of  VLSI  complexity  theory  is 
then  the  determination  of  such  region.  Since  if  a  point  (7  <>Ao)  is  feasible  then  (7  o*A)  is  feasible  for 
any  A  >  A  0  (just  waste  some  area!),  the  objective  can  be  reformulated  as  follows. 

Given  a  computation  problem  1Z.  its  VLSI  complexity  is  described  by  the  family  of  curves 
.4  =  a„  (7  ).  one  for  each  value  n  of  the  problem  size,  where  a„  (7  )-  mini. 4^  there  is  a  chip  that 
solves  II  on  instances  of  size  n  with  performance  (7  *4  0)l- 

Usually  there  is  a  minimum  value  7oin(n)  of  the  computation  time  below  which  no  feasible 
design  exists,  and  a  maximum  value  7max(n )  above  which  a,  (7  )  is  constant,  meaning  that  no  savings 
in  area  result  from  slowing  down  the  computation.  In  conclusion  we  would  like  to  find,  for  a  given 
problem,  the  value  of  a„  (7  ),  for  7  6[7jnm(n  X7„..(n  )J.  Typically  a„  (7  )  is  determined  within  a  con¬ 
st  int  factor  by  establishing  suitable  lower  and  upper  bounds.  .4s  expected,  a,  (7 )  is  increasing  in  n  and 
decreasing  in  7.  expressing  the  fact  that  a  faster  computation  requires  more  computing  resources. 


1.2  PROBLEM  STATEMENT  AND  ORGANIZATION  OF  THE  THESIS 
1.2.1  Sorting 

Sorting  is  a  fundamental  combinatorial  operation,  and  is  among  the  most  frequently  performed 
by  computing  systems.  Thus,  the  \'LSI  complexity  of  sorting  has  received  a  lot  of  attention  by 
researchers.  But,  in  spite  of  intensive  study,  this  problem  does  not  cease  to  offer  extremely  intnguing 
questions,  and  to  reveal  heretofore  unsuspected  facets. 

Formally,  the  injcl-sorting  problem  is  defined  as  follows: 

( 1 )  The  input  is  a  sequence  of  n  i-bit  keys,  each  a  member  of  a  finite  set  of  integers. 
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(2)  The  output  is  a  rearrangement  of  the  input  keys,  so  that  they  form  a  nondecreasing  sequence. 

Throughout  this  thesis,  we  represent  the  input  of  the  (nJ5:)-sorting  problem  as  an  n  xk  array  of 
binary  variables 

X  =  iX  's  =0,1 . n  —1 ;  j  =  0, 1 . k  -1  } 

where  X,J  is  the  coefficient  of  2J  in  the  binary  representation  of  the  t  -th  input  key.  The  i  -th  row  of 
X,  denoted  by  X, ,  represents  the  i  -th  input  key,  and  the  j  -th  column  of  X,  denoted  by  X  ‘ ,  represents 
the  j  -th  least  significant  position.  A  similar  notation  is  adopted  for  the  output  array  Y. 

One  could  be  tempted  to  analyze  the  complexity  of  sorting  as  a  function  of  nk,  the  total  number 
of  input  bits.  However,  as  will  be  fully  substantiated  in  the  following  chapters,  the  nature  of  sorting 
is  strongly  influenced  by  the  relative  size  of  n  and  k.  Thus,  it  is  appropriate  to  state  the  objective  of  our 
study  as  the  determination  of  the  minimum  area  A  =  a„  j.  (7 )  sufficient  to  lay  out  a  circuit  that  solves 
the  (n^c>-sorting  problem,  as  a  function  of  n  and  k. 

1.2.2.  Thesis  Outline 

This  thesis  is  organized  in  two  parts,  respectively  devoted  to  the  study  of  lower  and  upper  bounds 
to  the  area-tune  complexity  of  sorting. 

In  Chapter  2,  after  a  review  of  known  area-time  lower-bound  techniques,  we  studv  the  subject  of 
multiset  encoding,  which  turns  out  to  be  deeply  related  to  sorting.  Ln  fact,  although  the  input  to  the 
'nJ: /-sorting  problem  is  given  as  a  sequence  of  keys,  the  output  depends  exclusively  on  the  multiset 
underiving  the  input  sequence.  In  the  VLSI  environment,  where  computation  is  governed  by  the  flow 
of  information  in  the  two-dimensional  chip,  the  information-theoretic  content  of  the  input  multiset  has 
a  fundamental  influence  on  the  area-time  complexity  of  sorting.  The  fact  that  this  information  content 
is  very  sensitive  to  the  relative  sizes  of  n  and  k  is  the  primary  reason  for  which  the  nature  of  sorting  is 
strongly  dependent  on  the  length  of  the  keys. 

i  he  traditional  bisection  flow  technique  is  not  adequate  to  study  the  area-time  complexity  of  sort- 
.ng,  except  tor  a  special  interval  of  key  lengths.  Ln  Chapter  3  we  introduce  the  notion  of  square 


tessellation,  a  partition  of  the  layout  region  into  square  cells  of  identical  size,  and  we  show  how  to 
obtain  area-time  lower  bounds  in  terms  of  the  information  exchanged  across  the  boundary  of  the  tessel¬ 
lation  cells.  A  novel  feature  of  these  bounds  is  that  their  form  depends  upon  the  nature  of  the  mechan¬ 
ism  forcing  the  information  exchange.  'When  the  information  exchange  is  due  to  the  fact  that  the  vari¬ 
ables  output  on  one  side  of  the  cell  boundary  are  functions  of  variables  input  on  the  other  side,  the 
square  tessellation  technique  yields  lower  bounds  on  the  AT 2  measure.  This  mechanism  has  been  exten¬ 
sively  studied  in  the  literature,  especially  in  connection  with  the  bisection  technique.  In  addition  to  it, 
we  consider  here  for  the  first  time  another  mechanism,  which  we  call  saturation,  occurring  when  a  cell 
of  the  tessellation  fills  all  its  storage  in  the  course  of  the  computation,  and  sends  some  information  to 
the  rest  of  the  chip  for  the  only  purpose  of  temporary  storage,  to  request  it  back  at  a  later  time.  When 
the  information  exchange  is  due  to  saturation,  the  square  tessellation  technique  yields  bounds  on  the 
AT  measure. 

The  effectiveness  of  the  general  techniques  developed  in  Chapter  3  is  demonstrated  in  Chapter  4. 
where  several  lower  bounds  are  obtained  for  two  problems:  cyclic  shift  and  sorting.  Here  the  keys  are 
classified  into  short  ( k  ^  logn),  long  ( k  ^  21ogn),  and  medium-length.  Medium-length  keys  have  been 
heretofore  the  object  of  investigation,  and  can  be  adequately  studied  by  bisection  techniques.  It  is  for 
short  and  long  keys  that  the  full  power  of  the  square  tessellation  techniques  becomes  evident.  For  both 
cases,  AT 2  and  AT  lower  bounds  can  be  established,  and  it  is  interesting  to  observe  that  the  .47  -  bound 
dominates  in  fast  computation,  while  the  .47  bound  dominates  in  slow  computation.  In  the  last  section 
of  Chapter  4  we  obtain  bounds  for  the  problem  of  comparison  exchange,  a  special  case  of  sorting  where 
the  keys  are  just  two.  The  bound  is  on  the  A7  log  A  measure,  and  rests  crucially  on  the  bounded  fan-in 
assumption,  unlike  the  bounds  mentioned  above  that  hold  even  for  circuits  with  unbounded  fan-in  and 
fan-out. 


In  Chapter  5  we  turn  our  attention  to  upper  bounds,  and  review  some  well  known  parallel  algo¬ 
rithms  for  sorting,  as  well  as  some  networks  of  processors  particularly  suited  to  VLSI  implementations. 


In  Chapter  6  we  study  (nJfc)-sorting  for  k  -  logn  +  (Xlogn).  .After  explaining  why  this  particular 
value  of  keylength  plays  a  central  role  in  the  construction  of  sorting  circuits,  we  turn  our  attention  to 
specific  designs.  We  first  consider  the  bitonic  sorting  algorithm,  and  propose  two  architectures,  the 
pleated  cube-connected-cycles,  and  the  mesh  of  cube-connected-cycles,  both  of  which  achieve  optimal 
area-time  performance  in  a  wide  spectrum  of  computation  times.  The  fastest  bitonic  sorter  works  in 
time  T  —  (Klogbi).  To  obtain  faster  sorters  we  then  turn  our  attention  to  another  algorithm,  the 
merge-enumeration  combination.  A  network  that  combines  the  cube-connected-cycles  and  the 
orthogonal-trees  architectures  executes  this  algorithm  in  (Xlogn)  time  and  optimal  area. 

In  Chapter  7  we  consider  the  (njc)- sorting  problem  for  arbitrary  k,  and  we  propose  three  sorting 
networks,  respectively  tailored  to  short,  medium-length,  and  long  keys.  The  algorithms  presented  in 
this  chapter  are  new.  The  ones  for  short  and  medium-length  keys  exploit  efficient  encodings  of  schemes 
for  multisets,  while  the  algorithm  for  long  keys  takes  advantage  of  the  non-word- locality  of  the  I/O 
protocol.  The  fact  that  the  resulting  VLSI  designs  are  optimal  or  near-optimal  confirms  the  inherent 
validity  of  the  lower-bound  analysis  developed  in  Chapters  3  and  4. 

Some  closing  remarks  are  finally  presented  in  Chapter  S. 


PARTI 


LOWER  BOUNDS 


CHAPTER  2 


PRELIMINARIES 

2.1  INTRODUCTION 

Part  I  is  devoted  to  the  study  of  lower  bounds  on  the  area-time  complexity  of  sorting.  However, 
the  techniques  that  we  develop  are  general,  and  will  probably  be  useful  to  investigate  several  other 
problems. 

We  recall  from  Chapter  1  that  the  VLSI  complexity  of  a  computational  problem  II  is  described  by 
the  family  of  functions 

A  =an(D,r  6  [T,m(ft  X7m..(n  )l  •  2.1 

where  n  is  the  input  size,  a„  (T  )  is  the  area  of  the  smallest  design  that  solves  II  in  time  T,  T  mm  is  the 
minimum  time  required  to  solve  II  (regardless  of  the  areaX  and  T  max  is  a  time  such  that,  for  T  > 
T  ^ ,  a,  (T )  is  constant  with  respect  to  T. 

Area- time  lower  bounds  can  be  stated  in  different  forms.  The  most  common  are 

A  =  IK/  j(n , T ).  2.2 

T  =  Q(fJin,A)l  2.3 

g  (A .  T  )  =  n (/(*)).  2.4 

where  /  Lf  2  ,  g,  and  /  are  suitable  functions.  It  is  usually  a  simple  matter  to  convert  one  of  the 

above  forms  into  another.  The  choice  of  the  form  to  be  used  in  a  specific  case  is  only  a  matter  of  con¬ 


venience. 


2.1.1  Layout  Theory 


Since  a  VLSI  chip  can  be  viewed  as  the  layout  of  a  given  computation  graph,  some  useful  tools  to 
establish  area-time  lower  bounds  can  be  borrowed  from  layout  theory,  a  chapter  of  graph  theory  which 
studies,  among  other  things,  the  problem  of  determining  the  minimum  area  needed  to  embed  a  given 
graph  in  the  plane,  according  to  some  specified  layout  rules. 

Typically,  lower  bounds  (and  also  upper  bounds)  on  the  layout  area  are  given  in  terms  of  some 
auxiliary  quantities  associated  with  the  graph,  which  are  hopefully  easier  to  compute  or  to  bound  than 
the  area  itself.  Among  the  most  interesting  auxiliary  quantities  proposed  in  the  literature  are  the  bisec¬ 
tion  width  [TSOl  the  crossing  number  [L81al  the  wire  area  (LSlal  the  separator  [Ls80b,  VaSll  and  the 
bifurcator  [L823LS4]. 

When  applying  layout  theory  to  obtain  area-time  lower  bounds,  we  do  not  deal  with  a  specific 
graph,  but  with  all  the  graphs  that  can  support  the  computation  to  solve  a  given  problem  n,  in  a  given 
ume  T.  Thus,  our  goal  is  to  show  how  this  computational  property  of  the  graph  implies  a  bound,  either 
directly  on  the  area,  or  on  some  related  auxiliary  quantities.  Some  techniques  have  been  proposed  in  the 
literature  to  achieve  this  goal,  and  we  briefly  review  them  in  the  next  section. 

2.1.2  Area-Time  Lower-Bound  Techniques 

To  date,  all  known  area-time  lower  bounds  belong  to  one  of  the  three  following  classes. 

( 1 )  Input-output  bounds.  They  are  of  the  form 

AT  —  Msizeof  input  +  size  of  output  )  2_5 

and  are  a  trivial  consequence  of  the  fact  that  the  area  is  at  least  proportional  to  the  number  of  I/O 
ports,  which  in  turn  is  at  least  proportional  to  the  maximum  number  of  bits  that  the  chip  inputs  or 
outputs  in  a  time  unit.  For  boundary  chips,  (where  all  the  I/O  ports  are  placed  on  the  boundary  of  the 
layout  region),  the  I.-0  bound  becomes 
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pT  —  Clisiseof  input  +  sizeof  output)  2.6 

where  p  is  the  perimeter  of  the  layout  region.  Bound  (2.6)  is  usually  combined  with  other  considera¬ 
tions  to  obtain  area-time  bounds. 


(2)  Functional  dependence  bounds.  Functional  dependence  of  the  output  variables  on  the  input  vari¬ 
ables  can  sometimes  be  exploited  to  strengthen  the  I/O  bound,  as  in  [JhSOl  where  it  has  been  shown 
that,  for  the  addition  of  binary  integers  with  n  bits. 


or  equivalently. 


A  =  Q(£log<£)) 
AT  /logA  =  Clin). 
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The  argument  to  establish  the  bound  is  rather  subtle;  and  we  will  discuss  it  in  detail  in  Section  4.3, 
where  we  apply  it  to  the  problem  of  comparison  exchange. 


(3)  Information-Exchange  Bounds.  Almost  all  the  nontrivial  known  lower  bounds  on  area-time  com¬ 
plexity  are  of  the  type 

AT2  =  ClilKn)).  2.9 

where  I  in )  is  the  bisection-information  of  the  problem  II  being  considered,  a  very  important  notion 
introduced  by  Thompson  [T80J.  Informally,  the  bisection  width  b  of  a  graph  G  -  (VJE).  is  the 
minimum  number  of  edges  to  be  removed  in  order  to  separate  a  set  of  !W2  vertices  from  its  comple¬ 
ment.  (For  formal  definitions  and  generalizations  see  [T80],  and  also  Section  3.1.)  The  bisection- 
information  arguments  are  based  on  two  facts:  (i)  the  layout  area  is  at  least  proportional  to  the  square 
of  the  bisection  width;  (ii)  any  computation  graph  that  solves  a  given  problem  II  must  support  an 
information  exchange  I  in)  through  its  bisection,  where  /  in)  is  a  function  associated  with  n.  The 
bound  (2.9)  follows  easily  from  (i)  and  (ii),  coosidenng  that  b  ^ I  (n)/T .  The  evaluation  of  lint 
requires  an  argument  tailored  to  the  particular  problem  being  studied.  Indeed,  considerable  attention 
has  been  devoted  to  the  subject  of  information  exchange,  which  we  survey  briefly  in  the  next  section. 
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2J  3  Information  exchange 

In  recent  years  the  study  of  both  distributed  and  VLSI  computing  has  generated  considerable 
interest  in  the  analysis  of  the  amount  of  information  that  different  processors  have  to  exchange  when 
cooperating  in  solving  a  given  problem. 

Several  quantitative  definitions  of  the  information  exchange  I  associated  to  a  problem  II  have 
been  proposed,  and  several  techniques  to  lower  bound  /  have  been  developed.  The  objective  of  this  sec¬ 
tion  is  to  recall  the  main  concepts  at  an  intuitive  level,  and  to  indicate  the  appropriate  references,  where 
a  more  detailed  treatment  of  the  subject  can  be  found. 

The  general  framework  is  one  in  which  two  processors  P ,  and  P  2  cooperate  to  solve  a  problem  II . 
or  equivalently  to  compute  a  function  /.  The  basic  question  is  How  many  bits  do  the  processors  have 
to  exchange  during  the  computation?  The  answer  is  obviously  dependent  on  a  number  of  assumptions, 
and  different  authors  have  made  different  assumptions.  We  list  some  of  them  here. 

I/O-Variable  assignment.  In  the  simplest  case  the  assignment  of  I/O  variables  to  processors  is 
completely  specified.  In  applications  to  VLSI  we  are  typically  interested  in  a  class  of  assignments,  and 
the  information  exchange  must  be  minimized  over  the  class.  ([Y79i  [T80],  [BKSll  [AA80],  [Y8ll 
[LS8li  [BG82L  [MS82L  [Sa79i  [Vu83l  [JK84],  (AUYS3J.) 

Communication  protocol.  We  may  consider  a  one  directional  link,  say  from  P  t  to  P2  (one-way 
communication)  or  a  link  for  each  direction  (two-way  communication),  ([Y79]).  We  may  also  impose 
bounds  on  how  many  messages  can  be  exchanged,  a  message  being  a  run  of  bits  sent  by  Ps  to  P  . 
Alternatively,  we  may  bound  the  length  of  the  messages,  and  so  forth  ([PS82l  [DGS84]). 

Type  of  computation.  The  computation  performed  by  P  t  and  ? ;  can  be  assumed  to  be  deter¬ 
ministic.  or  nondeterministic,  or  randomized  (Las  Vegas).  ([MS821  [PS$2l  [LS81],  [DGS84],  [Y75l 
[ALY83P 

Complexity  measure.  Finally,  we  can  count  the  bits  exchanged  by  P  ,  and  P2  in  the  worst  case 
instance,  or  in  several  kinds  of  average  case  [Km S3}. 
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The  above  list  of  assumptions  is  by  no  means  exhaustive,  but  should  give  an  idea  of  the  variety  of 
issues  which  are  addressed  in  this  area  of  study.  Typical  results  that  can  be  found  in  the  literature  con¬ 
cern:  (i)  general  lower-bound  techniques;  (ii)  bounds  on  the  information  exchange  of  specific  functions; 
(iii)  the  study  of  complexity  classes  related  to  various  definitions  of  1;  and  (iv)  conditions  under  which 
the  bound  AT2  =  H(/2)  is  valid  in  the  VLSI  model. 

A  complete  account  of  the  theory  of  information  exchange  is  not  our  present  objective.  However, 
we  will  return  to  this  subject  to  propose  some  new  developments,  with  relevant  applications  to  VLSI 
complexity. 

2.1.4  Summary  of  Part  I 

The  input  to  a  sorting  problem  is  a  multiset,  and  for  this  reason  efficient  schemes  to  encode  mul¬ 
tisets  are  essential  to  obtain  good  algorithms.  Moreover,  the  fact  that  the  efficiency  of  a  given  encoding 
scheme  is  very  sensitive  to  the  ratio  between  the  size  of  the  multiset  and  the  size  of  the  universe  from 
which  the  elements  are  drawn,  makes  the  nature  of  the  sorting  problem  vary  considerably  with  the 
length  of  the  keys  being  sorted.  Thus,  both  lower-bound  arguments  and  upper-bound  constructions 
greatly  benefit  from  a  solid  understanding  of  the  subject  "encoding  of  multisets'  which  is  treated  in 
Section  12. 

Chapter  3  is  devoted  to  general  lower-bound  techniques.  In  Section  3.1  we  generalize  the  notion 
of  bisection  width  by  introducing  the  notion  of  dichotomy  width  of  a  graph,  a  quantity  very  useful  to 
lower  bound  the  layout  area  of  some  graphs.  In  Section  3.2  we  show  that  a  suitable  generalization  of 
the  traditional  concept  of  information  exchange  can  be  used  to  lower  bound  the  dichotomy  width  of 
computation  graphs.  When  combined  with  those  of  Section  3.1,  these  results  provide  powerful  tools  to 
lower  bound  the  area-time  complexity  of  computational  problems. 

The  traditional  bisection- information  techniques  as  well  as  the  generalization  proposed  in  Section 
3.2  capture  the  idea  that  if  some  variables  output  at  a  given  place  carry  information  on  other  vaiables 
input  at  a  different  place,  then  some  kind  of  information  flow  between  the  two  places  will  be  required 
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by  the  computation-  However,  there  are  cases  where  the  information  is  input  in  a  place  close  to  where 
it  must  be  output,  and  nevertheless  it  must  be  temporarily  transferred  to  a  different  place,  due  to  the 
fact  that  all  local  storage  is  saturated.  In  Section  13  we  show  how  this  intuition  can  be  formalized  by 
defining  the  notion  of  information  exchange  under  bounded  storage.  We  also  develop  a  general  tech¬ 
nique  to  obtain  area-time  lower  bounds  based  on  this  notion. 

In  Chapter  4  we  apply  the  results  of  Chapter  3  to  specific  problems.  In  Section  4.1  we  derive 
iower  bounds  for  the  information  exchange  and  the  area-time  complexity  of  cyclic  shift.  Although 
cyclic  shift  is  an  interesting  problem  in  its  own  rights,  our  main  motivation  to  analyze  it  is  due  to  the 
relationship  between  cyclic  shift  and  sorting,  to  be  systematically  exploited  in  Section  4.2  where  we 
finally  concentrate  on  the  sorting  problem. 

Section  4.2  is  organized  in  three  subsections,  respectively  devoted  to  the  study  of  three  different 
ranges  of  key  lengths.  Several  new  lower  bounds  are  obtained  both  on  the  AT2  measure  (using  the 
dichotomy-information  technique),  and  on  the  AT  measure  (using  the  saturation  technique).  .As  we 
shall  see.  the  AT2  bounds  dominate  in  fast  computations,  whereas  the  AT  bounds  dominate  in  slow 
computations. 

Finally,  in  Section  4.3  we  discuss  the  area-time  lower  bounds  for  the  comparator-exchanger, 
which  can  be  viewed  as  a  sorter  of  two  keys.  Here  we  have  to  investigate  the  notion  of  functional 
dependence  and  its  effect  on  the  area-time  performance.  Crucial  to  this  type  of  argument  is  the  notion 
of  bounded  fan-in  digital  circuits. 

2.2  ENCODING  MULTISETS 

This  section  is  devoted  to  the  study  of  efficient  encodings  of  multisets.  We  are  interested  in  mul¬ 
tisets  because: 

(i)  The  input  to  a  sorting  problem  is  a  multiset  (the  ordering  of  the  elements  in  the  input  list  is 


.immaterial). 
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(ii)  A  sorted  list  can  be  viewed  as  a  canonical  representation  of  the  underlying  multiset  (two  lists 
of  elements  represent  the  same  multiset  if  and  only  if  they  are  identical  when  sorted). 

The  study  of  efficient  encodings  of  multisets  will  provide  us  with  a  background  both  for  the 
information-based  lower  bounds  of  Chapter  4,  and  and  for  the  upper  bound  constructions  of  Chapters  6 
and  7. 

A  multiset  S  is  a  collection  of  elements  from  a  totally  ordered  set  U  called  the  universe,  with 
repetitions  allowed.  In  the  sequel  we  are  only  concerned  with  finite  multisets  and  finite  universes  so 
that,  without  loss  of  generality,  we  can  use  the  following  notation; 

S={X  o,X, . Xn.x\  2.10 

U  -  {0, 1 . r-1  }.  2.11 


Thus,  n  is  the  size  of  the  multiset,  and  r  is  the  size  of  the  universe.  Usually  we  think  of  the  elements 


of  U  as  encoded  in  binary,  and  we  denote  by  k  -  llogr 


the  number  of  bits  needed  rn  encode  an  ele¬ 


ment.  Since  the  order  of  the  element  of  S  is  immaterial,  representation  (2.10)  is  not  unique,  and  given 
any  permutation  tt (0),  v  (l), . . . ,  it  (rc  —  l)  of  the  integers  0,1 . n  —1.  we  can  also  write 


S  ~~  \X  ^QlrX  ,  .  .  .  •  X  —l)  }. 

This  representation  becomes  unique  if  we  add  the  constraint  that  .  for 

i  sO,l,.,.,n  —2  ,  or  in  other  words  if  we  require  that  the  sequence  X„o)'X-*u . -Y  -n  be 

sorted  in  nondecreasing  order.  From  this  standpoint  sorting  becomes  the  operation  of  computing  a 
canonical  representation  for  a  multiset. 

Other  representations  are  clearly  possible,  and  could  be  more  convenient  in  some  situations.  In 
particular,  in  VLSI  computation  we  are  interested  in  nonredundant  representations  because  they  require 
less  bandwidth  for  transmission. 
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2.2.1  Counting  Arguments 

A  simple  combinatorial  argument  shows  that  the  number  of  multisets  of  size  n  in  a  universe  of 
size  r  is  n  +  r  ”  *  .  Thus,  the  number  of  bits  necessary  to  encode  a  multiset  is 

e(n,r)  =  log  r+”  —  *  2.12 

If  we  use  Stirling’s  approximation  for  the  factorial,  after  some  manipulations  we  can  rewrite  Eq.  (2.12) 
as 

e  (n  r  )  =  rilog  (1  +r  /n)  +  rlog  (1  +  n/r)  +  lower  order  terms.  2.1 3 

It  is  interesting  to  consider  the  asymptotic  behavior  of  el  nr)  when  r  is  an  increasing  function  of  n,  as  in 
the  following  examples 

(1 ) r/n  0  e(n,r)  rilog  n  —  logr) 

(r  =r0  =  constant ,  e(n  rj^  rJogn  ) 

(2)  r  =  n  Xconstant  einr)  —  <Xn  ) 

1  r  =  n,  ein  n)  55  2n) 

(3 ) r/n  oo  einr)  28  nilogr  —logn) 

(n  =  n  o  -  constant .  e  in  „  /■ )  =*  n  </ogr  ) 

Certainly  there  are  encodings  of  multisets  that  use  strictly  einr)  bits.  However,  we  are  interested  in 
encodings  that  either  arise  naturally  from  problems,  or  that,  although  artificially  introduced,  preserve 
some  intuitive  meaning,  and  are  useful  in  multiset  manipulations. 

2.2.2  List  Encoding 

The  most  natural  way  to  describe  a  multiset  consists  in  giving  a  list  of  its  elements,  m  any  order. 
Clearly  elltl  ( nr )  =  nlogr  —  nk  bits  are  used  for  this  representation.  Thus,  the  list  encoding  is  optimal 
(in  the  order)  if  and  only  if  the  universe  is  large  enough,  namely  if 
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k  -  logn  +  G  ( logn  )  2.14 

or,  equivalently,  if  r  >  n{  l*al  for  some  a  >  0  .  The  list  encoding  becomes  very  inefficient  for  a 
small  universe.  In  the  extreme  case  of  r  =  2  (n  ,2)  =  n  .whereas  e  (n  2  )  **  logn  .  Therefore  we 
turn  out  attention  to  another  method. 


2.23  Multiplicity  Encoding 

Another  simple  way  to  specify  a  multiset  is  to  say  how  many  occurrences  it  contains  of  any  given 
element  of  the  universe.  Formally,  we  introduce  the  multiplicity  function  fiU)(i  =0,1  ,...,r  —  1 ) 
of  multiset  S  defined  as 

/x  (t )  ^number  of  occurrences  of  element  i  in  multiset  S.  2.15 

Since  n  (i )  is  at  most  n,  ft  (» )  can  be  represented  with  log  (n  +  l)  <  logn  +  1  bits,  and  hence  S  can  be 
encoded  with  emult  In  ,  r )  =  r  (logn  +  1 )  bits.  This  encoding  is  optimal  in  the  order  when 

k  =  logn  —  G  ( logn  )  2.16 

or,equivalently,  if  r  <  nu~a>  for  some  a  >  0  .  Slightly  better  results  can  be  obtained  by  using  a 
variable  length  encoding  for  fi  (i ) .  For  example  we  can  encode  integer  h  with  2  Jlog  (h  +  1 )  bits  by 
using  the  empty  string  for  h  ■  0.  The  multiplicity  function  can  then  be  represented  by  the  list 
fj.(0),n(l),...,fi(r  —  1  )  with  the  commas  encoded  as  ’OY.  Thus  we  can  use  a  total  number  of  bits 


e  'mui  (nr)=  £  2  JlogOx  (t  )  +  l)  1  +  2  (r  —  1  ). 

l  =*)  *  * 


It  is  easy  to  see  that,  under  the  constraint  /x(0)-r)u(l)  +  ...J-^(r  — 1)  =  n, 
e ' mm  (n  „-  )  =  0  (  rlog  (  n  tr  +  1  )  +  r  ) 
which  is  optimal  for  r  .  For  r>n,  we  must  resort  to  different  techniques. 


••'AV-  .  V 


2.2-4  The  Insert-and-Prune  Encoding 

In  this  section  we  propose  a  new  encoding  for  multisets  which  is  based  on  a  sorting  method  and  is 
not  as  natural  as  the  list  and  the  multiplicity  schemes,  but  it  is  simple  and  elegant.  Moreover  it  can  be 
effectively  used  in  some  sorting  algorithms. 

Let  us  begin  with  a  simple  observation.  In  a  sorted  sequence  of  n  elements  of  k  bits  each,  the 
sequence  of  bits  in  the  most  significant  position  is  a  run  of  zeros  followed  by  a  run  of  ones.  Therefore  it 
can  be  completely  descn&cu  oy  specifying  how  many  zeros  there  are,  which  only  requires  logn  bits 
instead  of  the  n  bits  taken  in  the  list  representation.  In  general,  in  a  sorted  sequence  the  j-th  most 
significant  position  (from  the  left)  contains  at  mast  V  alternating  runs  of  zeros  or  ones.  Thus,  for 
j  <  logn  not  all  the  binary  ,  sequences  of  n  bits  are  candidates  to  be  the  j-th  position  of  a  sorted 
sequence,  and  therefore  less  than  n  bits  are  needed  to  encode  that  position. 

We  could  try  to  exploit  systematically  the  above  observations  and  build  an  efficient  encoding 
based  on  the  length  of  runs  of  identical  bits  in  each  bit  position  of  the  sequence,  but  the  resulting 
scheme  would  be  rather  awkward  and  difficult  to  manipulate.  However,  the  above  discussion  reveals 
an  important  property:  the  leftmost  bit  positions  in  a  sorted  sequence  carry  less  information  than  the 
number  of  bits  devoted  to  these  positions  in  the  list.  As  it  turns  out,  if  we  have  some  extra  knowledge 
about  our  sorted  sequence,  we  may  even  completely  reconstruct  the  sequence  by  looking  only  at  its 
least  significant  position!  This  is  a  consequence  of  the  following  result. 

Theorem  2.1.  If  S  =  {  X <»...,  Xn.t}  is  a  multiset  drawn  from  the  universe  U  ={0,1 . r  —  1  !, 

and  T  is  the  sorted  list  of  the  union  of  S  and  U,  then  there  is  a  one-to-one  correspondence  between  S 
and  the  sequence  of  bits  in  the  least  significant  position  of  T. 

Proof.  T  is  the  concatenation  of  r  subsequences  the  i-th  of  which  consists  of,  fi(i )  +  1  copies  of  ele¬ 
ment  i  ( i  =0, ....  r  —1),  where  fi  (i )  is  the  multiplicity  function  of  S.  The  situation  is  illustrated  in  Fig¬ 
ure  2.1.  The  least  significant  bits  of  T  are  the  concatenation  of  r  sequences,  the  i-th  of  which  consists  of 
M  (i )  +  1  identical  bits  each  equal  to  i  modulo  2.  Thus,  from  the  least  significant  bits  we  can  recover 


m(0)+1  m(1)+1 


M(r-1)+1 


Figure  2.1.  The  structure  of  sequence  T. 

the  multiplicity  encoding  of  5.  and  5  itself.  The  converse  is  obvious.  □ 

Remark.  It  follows  from  Theorem  2.1  that  the  sequence  of  bits  in  the  least  significant  position  of  T  is  a 
valid  encoding  of  S,  requiring  n  +  r  bits.  For  r  *  n  the  encoding  is  optimal  up  to  lower  order 
terms.  For  r  »  n  or  r  «  n  the  encoding  is  highly  inefficient.  However,  for  r  >  n ,  the  follow¬ 
ing  generalization  of  Theorem  2.1  yields  a  better  result. 

Theorem  2.2.  Let  for  simplicity  n  3  2 V  =*  2*  .and  s  -  2°  be  powers  of  two.  Let  also 

S  =  be  a  multiset  from  the  universe  U  =  |0 . r—  1}  ,  and 

UXs )  =  |0j ,2s  — s  )  be  a  sampling  of  £/  with  period  r.  Define  T  as  the  sorted  list  of  the  union 

of  5  and  U  (s  ).  Then,  there  is  a  one-to-one  correspondence  between  S  and  the  sequence  formed  by  the 
<r  +1  least  significant  bits  of  the  elements  of  T. 

Proof.  We  introduce  the  notation 

A '  -  multiset  of  the  prefixes  of  length  k  —  O’  of  the  elements  in  multiset  A 

and  we  define  U  (s)‘  and  T  '  accordingly.  Clearly  U  is  ) '  =  {0,1 , . . . ,  r '  —  1 }  where  r '  =  r  /s  .  Thus 
we  can  apply  Theorem  2.1  to  multiset  S'  and  universe  U(s  )' ,  to  reconstruct  T'  from  the  (k  —  cr)  -th 
most  significant  bit  position  of  T.  Then  we  easily  reconstruct  the  entire  T  by  concatenating  most  and 
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least  significant  position* of  each  element.  Finally  we  obtain  S  *T  —U(s).  □ 


Remark.  Theorem  22  reduces  to  Theorem  2.1  when  cr  =  0 . 


We  call  insert-arid- prune  encoding  the  representation  of  5  obtained  by  augmenting  5  with  V  (s  )  and 
sorting  the  result  (therefore  effectively  inserting  the  sorted  Uts)  into  the  sorted  SX  and  by  subsequently 
removing  ( pruning )  the  k  —cr  —  l  most  significant  bits  of  each  element.  The  number  of  bits  required 
by  this  encoding  is 

ear  (ns)*  (cr+1)  (n  +  2 7*r  X  2.17 

For  r<n  ,  the  choice  <r  =  0  minimizes  taf  ( n,r  )  giving  eitf  («/•)  =  n  +  r  and  the  encoding  is  not 
optimaL  For  r^n  .the  choice  <r»log(r/n )  yields  an  optimal  encoding  with 


eltf  (nr)^O  (n  ( log  (r  /n  )  +  l)X  2.18 

Summary  of  insert -and -prune  encoding.  If  5  is  a  multiset  of  n  elements  of  k  =logn  +h  bits,  we  can 
encode  it  with  eiSf  *  ( h  +  1  )  2n  bits  by  the  following  procedure. 

1.  Add  to S  then  elements  {2/* »  :i  *0,1  ,...,n—l) . 

2.  Son  the  resulting  multiset. 

3.  Retain  only  the  bits  in  the  ( h  +1)  least  significant  positions. 

A  picture  of  the  encoding  scheme  is  given  in  Figure  12. 


222  Two-Stage  Encoding 

Given  a  multiset  S  with  a  multiplicity  function  m  (i  )  ,  £  =0 . r  —1.  we  define  the  distribution 

function 

M  ( £  )*  £  X  £  *0,1 . r  —  1.  2.19 

jdi 

Obviously  (\f  (OX-W  (1) .....  M  (r  — l))  is  a  sorted  sequence  with  all  elements  less  than  or  equal  to  n.  If 
n  ,  we  can  then  encode  this  sequence  by  the  insert-and-prune  method.  Since  the  size  of  the 


Redundant  Bits 


2n  Elements 


Figure  12.  The  insert-and-prune  (isp)  encoding  scheme. 


sequence  is  r ,  and  the  size  of  the  elements  is  logn  »logr  —  l+ilogn  —  logr  +1).  the  two-stage  encoding  of 
5  uses  a  number  of  bits 

zinjr  )=2r  (logn  -logr  + 1  )*0  (rlog  (n  /r  +1 )),  ?  ?Q 

which  is  ootimaL 


2.2.6  Summary  of  Optimal  Encodings 

We  summarize  the  encodings  described  in  the  previous  section  in  Figure  2.3,  where  we  show  the 
ranges  of  k  in  which  each  of  the  encodings  is  optimal  We  recall  that  the  results  are  of  an  asymptotic 
nature,  and  are  based  on  the  assumption  that  k  increases  with  n. 

For  completeness,  we  report  here  that  a  multiset  S  »  (X  „ , . . . ,  Xn  _!  I  can  be  represented  by  speci¬ 
fying  the  difference  between  consecutive  elements  in  the  sorted  arrangement  of  S.  This  encoding  is 


efficient  when  r  *  n,  and  has  been  successfully  exploited  in  [Lo83]  to  obtain  optimal  sorting  algo¬ 
rithms  on  a  distributed  system. 


k  -log  n— fl(log  n)  k -log  n -o(log  n)  k -log  n  +  o(log  n)  k  »logn  +  J2(logn) 


Figure  13.  Ranges  of  optimality  of  encoding  schemes  for  multisets. 


CHAPTER  3 


LOWER-BOUND  TECHNIQUES 


The  lower-bound  techniques  of  this  chapter  use  a  combination  of  a  geometric  argument,  based  on 
a  suitable  subdivision  of  the  layout  region,  and  an  information-theoretic  argument,  based  on  the  infor¬ 
mation  exchange  between  a  region  of  the  geometic  subdivision  and  the  remaining  part  of  the  layout. 

Two  basic  methods  to  subdivide  the  layout  region  will  be  considered. 

(i)  Bipartition.  It  is  the  classical  method  introduced  by  [TSOi  whereby  the  subdivision  is  obtained 
by  cutting  the  layout  into  two  regions  separated  by  a  straight  line  (or  a  simple  deformation  thereof! 

Cii)  Square  tessellation.  It  is  a  method  that  we  shall  introduce  in  the  next  section,  and  consists  in 
subdividing  the  layout  region  in  a  mesh  of  square  cells  all  of  the  same  size. 

We  shall  also  make  use  of  two  basic  information-theoretic  notions. 

(a)  Information  exchange.  It  is  the  classical  notion  studied  by  several  authors,  as  briefly  reported 
in  Section  2.1.3,  and  will  be  formally  defined  in  Section  3^2. 

(b)  Bounded-storage  information-exchange.  It-  will  be  formally  defined  in  Section  3.3  as  a 
refinement  of  (a)  when  a  bounded  storage  is  assumed  for  the  processors  that  execute  the  computation, 
and  is  instrumental  to  study  information-exchange  in  saturation  conditions. 

When  classified  with  respect  to  the  geometric  and  the  information-theoretic  notions  of  w  hich  they 
make  use,  the  lower-bound  techniques  can  be  of  one  of  the  four  types:  (iMa),  (iMb),  (iiMa),  (iiMb). 

.As  we  shall  see,  types  (iMa)  and  (iiMa)  yield  lower  bounds  on  the  AT 2  measure,  and  type  (iiMb) 
yields  lower  bounds  on  the  AT  measure.  Presently,  we  do  not  know  of  any  useful  application  of  tech¬ 
nique  (iMb). 


In  all  the  applications  where  we  shall  make  of  the  square-tessellation  technique,  although  we  sub¬ 
divide  the  layout  into  many  regions,  we  only  need  to  consider  the  information  exchange  occurring 
between  one  region  and  the  rest  of  the  layout. 

Thus,  in  both  the  bipartition  and  the  square  tessellation  techniques,  we  are  effectively  studying 
the  information  exchange  that  occurs  between  a  set  of  nodes  of  the  computation  graph,  and  its  comple¬ 
ment.  We  refer  to  a  partition  of  the  vertex  set  of  a  graph  into  two  sets  as  to  a  dichotomy  of  the  graph. 
As  we  shall  see,  dichotomies,  and  the  related  notion  of  dichotomy-width  (to  be  formally  defined  in  Sec¬ 
tion  3.1)  play  a  relevant  role  in  lower-bound  theory. 

To  avoid  terminological  confusion,  we  stress  the  point  that  dichotomy  is  a  topological  notion  per¬ 
taining  to  a  graph,  while  bipartition  is  a  geometric  notion  pertaining  to  a  layout  (of  a  graph).  The  two 
concepts  should  be  kept  distinct,  although  any  bipartition  of  the  layout  induces  a  dichotomy  of  the 
graph. 

We  shall  use  the  term  bisection  only  in  a  topological  denotation,  to  refer  to  a  dichotomy  which  is 
(roughly)  balanced  with  respect  to  a  given  weight  of  the  vertices  of  the  graph.  This  is  in  agreement 
with  the  original  definition  given  in  [T80].  Instead,  we  shall  not  use  the  term  bisection  to  denote  a 
geometric  cut  of  the  layout,  even  if  it  induces  a  bisection  in  the  corresponding  graph. 

3.1  THE  DICHOTOMY  LOWER  BOUND  ON  THE  LAYOUT  AREA 

In  this  section  we  present  a  new  technique  to  obtain  lower  bounds  on  the  layout  area  of  graphs. 
The  technique  is  based  on  the  notion  of  dichotomy  which  generalizes  the  notion  of  bisection. 

Given  a  graph  G  -  (V  £ )  we  call  dichotomy  a  partition  D  =(V  1(V ,)  of  the  vertex  set  V ,  and  we 
denote  by  S(D )  the  number  of  edges  of  G  that  connect  V  j  to  V  2 .  We  define  the  dichotomy  width  with 
respect  to  a  class  T  of  dichotomies  of  G,  as  the  minimum  number  of  edges  that  have  to  be  removed  in 
order  to  disconnect  V'  j  from  V ,  over  all  dichotomies  in  T.  Formally  we  have  the  following  definition. 

Definition  3.1.  Given  a  graph  G  -  (V  ,£  ),  and  a  class  f  of  dichotomies  of  G.  the  T  -  dichotomy  width  is 


denned  as 


8r  ^ min8(Z> ).  it 

Dir 

Remark.  If  TMD'AV  j  I  =  |iV  /2  j},  where  N  =  1 V  I,  then  8r  becomes  the  minimum  bisection  width  as 
defined  by  Thompson  [T80]. 

In  the  sequel  we  consider  some  choices  of  T  that  enable  us  to  prove  a  lower  bound  on  the  layout 
area  in  terms  of  5r  .  We  begin  with  the  simple  case  in  which  T  is  the  class  of  all  dichotomies  ( V  j,  V  2 ) 
with  V  j  =  m  (m  <  N  *  I V  I ) : 

r(W-{(VlfV2):  IV,I  *m}.  12 

To  simplify  the  notation,  we  write  8(m  )  for  S^,  the  dichotomy  width  of  G  with  respect  to  . 

We  discuss  now  some  concepts  that  are  useful  in  relating  the  layout  area  A  of  a  graph  to  its 
dichotomy  width  SCm  ) .  A  graph  is  to  be  laid  out  on  the  layout  grid,  a  plane  grid  the  vertices  of  which 
have  integer  coordinates  in  a  suitable  cartesian  frame  of  reference.  A  layout  of  a  graph  is  an  assign¬ 
ment  of  nodes  to  vertices  of  the  grid,  and  of  edges  to  paths  of  grid  edges,  where  different  edges  of  G 
share  only  grid  vertices.  This  restriction  implies  that  all  nodes  have  degree  at  most  four,  a  property  we 
shall  always  assume  when  discussing  layouts  of  graphs. 

Beside  the  layout  grid,  it  is  convenient  to  consider  another  grid,  the  auxiliary  grid,  the  vertices  of 
which  are  the  points  of  semi-integer  coordinates,  as  shown  in  Figure  11. 

The  area  of  a  given  layout  is  defined  to  be  the  area  of  its  smallest  enclosing  rectangle  with  boun¬ 
dary  on  the  auxiliary  grid.  The  layout  area  of  A  of  a  graph  G  is  the  area  of  its  smallest  layout.  A  tig- 
tag  line  is  either  a  straight  line  on  the  auxiliary  grid,  or  a  pattern  of  the  kind  shown  in  Figure  12. 
Formally,  a  vertical  zig-zag  line  is  a  set  of  the  form 

I  (xo,  y  ):  — eo  <  y  <  y0)  U  {  (x ,  y0  ):x0  4  x  4  x0  +  a  }  U{  (x0  +  a  ,y  ):y0  <  y  <  oo( 
where  a  6  (0,1).  A  horizontal  zig-zag  line  could  be  defined  similarly. 


Figure  3.1.  The  layout  grid  (solid)  and  the  auxiliary  grid  (dotted). 


Figure  3.2 


I-  i-  -I 


A  (vertical)  zig-zag  line. 


The  next  theorem  states  the  first  lower  bound  to  A  in  terms  of  5(m  ) .  The  result  generalizes  the 
bound  A  ^<!5(.V  /2)  —  l)2  obtained  bv  Thomcson  [TSOl  and  the  proof  is  based  on  the  same  technique 
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introduced  by  [TSOl  which  we  call  a  bipartition  technique  because  it  is  based  on  a  suitable  partition  of 
the  layout  into  two  regions. 

Theorem  3J.  If  a  graph  G  has  dichotomy  width  8 (m ) ,  then  the  sides  lx  and  ly  of  the  smallest  enclos¬ 
ing  rectangle  R  of  any  layout  of  G  have  length  at  least  8(m )  —  1 ,  whence 

A  >  (S(m)-l)2  =  QiSKm ) ).  13 

Proof.  It  is  easy  to  show  that  there  exists  a  vertical  zig-zag  line  which  splits  R  into  two  regions 
(separated  by  either  l7  or  1,  +1  grid  segmentsX  one  containing  m  nodes  of  G,  and  the  other  N  —m  .  By 
the  definition  of  Tm  -dichotomy  width  at  least  8 (m )  edges  cross  the  boundary  between  the  two  regions, 
and  therefore  ly  +  1  ^  8(m  )  or  l7  ^  8(m )  —  1 .  □ 

The  next  theorem  provides  another  bound  on  A  in  terms  of  8(m  ) .  The  bound  is  better  than  (13) 
whenever  m  «  dN ).  The  proof  introduces  a  novel  technique,  which  we  call  the  square  tessellation 
technique,  because  it  is  based  on  a  partition  of  the  layout  region  into  a  mesh  of  square  cells,  all  of  the 
same  size.  • 

Theorem  3.2.  For  every  graph  G  -  (V  JE ),  and  every  m  <S , 


.4=0 


—  S^m) 
m 
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Proof.  Given  a  layout  of  G  (on  the  unit  grid),  let  R  be  the  smallest  enclosing  rectangle  (on  the  auxili¬ 
ary  grid).  Let  us  consider  on  the  auxiliary  grid  a  mesh  of  square  cells  with  sides  of  length 


l  -  (8lm ) — 1)/4 
Figure  13). 


,  and  such  that  one  cell  has  a  vertex  overlapping  with  the  southwest  corner  of  R  (see 


We  claim  that  no  cell  of  the  mesh  contains  m  or  more  nodes  of  G.  In  fact  if  a  cell  contains  m  or 
more  nodes  then  we  can  find  a  zig-zag  line  that  cuts  the  cells  into  two  polygons  one  of  which,  called  P, 
contains  exactly  m  nodes.  (See  Figure  3.4)  This  polygon  has  a  perimeter 
p  <  41  =  4  |(S(m  )  —  1/4  I  ^  Sim )  —  1,  so  that  less  than  5 (m )  edges  can  cross  it,  contradicting  the 


m  Nodes 


Figure  3.4.  A  cell  with  m  nodes  or  more. 

Due  to  some  nonempty  cells  which  have  only  a  partial  overlap  with  R.  the  layout  area  A  can  be 
smaller  than  A.  .  However  these  cells  can  occur  only  at  the  boundary  of  R.  Since  Theorem  3.1  ensures 
that  the  length  of  each  side  of  R  is  at  least  four  times  the  length  l  of  the  side  of  the  cell  R  contains  at 
least  16  cells,  so  that  it  is  easy  to  show  that  .4  ^  16/25  Ac  .  Thus. 

A  ^44—  Mm)- 1)M  :=4c— 1 3.6 
25  m  [  25  m 

and  the  theorem  is  proved.  □ 

Remark.  Equation  3.6  yields  a  better  bound  than  Equation  3.3  for  m  <.V  ,'25. 

In  general  the  best  bound  that  we  can  obtain  for  the  area  of  a  given  graph  G  from  Theorem  3.2 
corresponds  to  the  value  m0  of  m  that  maximizes  in  the  function  )/m  .  For  most  of  the 


computation  graphs  considered  in  the  area-time  literature  m^N  /2  (  or  more  in  general  m0=6GV  )) , 
and  Theorem  11  is  sufficient  to  obtain  good  lower  bounds.  This  fact  accounts  for  the  success  of  biparti¬ 
tion  techniques,  and  has  lead  researchers  to  focus  almost  exclusively  on  balanced  partitions  of  computa¬ 
tion  graphs.  However,  the  computation  graphs  that  solve  some  important  problems,  including  sorting, 
have  a  Aim )  function  whose  maximum  is  achieved  for  values  of  m  considerably  smaller  than  N.  In 
these  cases,  the  notion  of  dichotomy  and  the  square  tessellation  technique  developed  in  the  present  sec¬ 
tion  are  instrumental  to  obtain  right  bounds. 

When  applying  dichotomy  arguments  to  computation  graphs,  we  often  need  to  consider  a  class  T 
more  general  than  Tm  .  For  example  we  may  focus  on  the  set  U  of  the  nodes  that  are  input  ports,  and 
we  may  want  8r  to  represent  the  minimum  number  of  edges  to  be  removed  from  G  in  order  to  discon¬ 
nect  a  set  V !  containing  m  input  ports  from  its  complement  V  2  .  In  this  case  the  appropriate  definition 
for  T  is 

r  =  f( V  „V  3>.l  V  j  C\U  l  =  m  ).  3.7 

Of  course,  if  V  =V ,  we  obtain  again  Tm .  We  can  take  one  more  step  toward  generality  and  consider  a 
graph  G  with  each  vertex  v  has  a  weight  miv  ).  For  example  m(v  )  could  be  the  number  of  input  bits 
read  by  mode  v  during  the  computation.  Then  we  may  set 

r  =  l(V„V2):  £  m(v)*m}.  38 

Obviously  3.8  reduces  to  3.7  when  m(v  )=  1  for  v  €£/  ,  and  m(v )  =  0  for  v  —C  .  When  dealing 
with  a  weighted  graph  it  is  more  useful  to  include  in  T  all  dichotomies  (V  ltV' ,)  such  that  V  ,  has  glo¬ 
bal  weight  :n  a  given  interval  [m  ,,ti  :j .  In  fact  we  can  state  the  following  result. 

Theorem  3J.  Let  G  -  (V  £  )  be  a  graph  W'here  each  node  v  has  a  nonnegative  integer  weight  miv).  Let 
M  —  y  m  <v ) ,  and  let  m  (v  )  <  m  2  —  m  [  +  1  ,  for  any  v.  If  we  define 


T  =  {(V  j,V  ;.h*n  ,  ^  T  m  (v  )  ^  ,7t  ;| 
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then  we  have 

A  =  110 

7TI 2 

Proof.  The  argument  is  the  same  as  the  one  made  to  prove  Theorem  12,  except  that  the  claim  made  on 
the  square  cell  of  side  l  -  J  (Sr  — 1)/4  J  will  now  state  that  the  global  weight  of  the  nodes  inside  the  cell 

is  less  than  m 2 .  The  bound  m (v )  ^  mj-mj  +  1  ensures  that  if  a  cell  has  global  weight  m *  or  more, 
then  it  is  always  possible  to  construct  a  zig-zag  cut  delimiting  a  polygon  with  a  perimeter  p  ^  5r  —  1 , 
which  includes  a  set  of  nodes  with  global  weight  in  the  interval  [m  2jn  2]  •  Thus,  less  than  5r  edges  con¬ 
nect  the  set  V  j  of  the  nodes  inside  the  polygon  to  the  set  V  2  of  the  nodes  outside  the  polygon,  contrad¬ 
icting  the  fact  that  (V  ltV  2)€  1" .  □ 

Remark.  Theorem  12  is  a  special  case  of  Theorem  13,  and  is  obtained  by  setting  mt  =  m2  -  m  ,  and 
m(v)=l  for  all  v’s. 

3.2  INFORMATION  EXCHANGE  AREA-TIME  LOWER  BOUNDS  (AT2  THEORY) 

In  this  section  we  introduce  the  notion  of  information  exchange  for  a  computational  problem  II , 
and  we  relate  it  to  the  dichotomy  width  and  to  the  AT 2  measure  of  computation  graphs  for  II . 


! 

1 


*;■> 


/ n formation  Exchange.  Let  P  x  and  P  2  be  two  processors  cooperating  to  solve  problem  II .  Let  V  be 
the  set  of  input  and  output  variables  of  II  each  of  which  is  assumed  to  be  binary.  We  call  HO  assign¬ 
ment  a  partition  7)  =  (V,  ,V2 )  of  */,  where  %  is  the  set  of  variables  that  have  to  be  input  or  output  by 
processor  Ps  (s  »  1.2  ).  We  define  the  information  exchange  of  II  under  assignment  t)  as 

/( tj)-  the  minimum  over  all  the  algorithms  (that  solve  II  under  the  variable 
assignment  rj)  of  the  maximum  over  ail  the  problem  instances  of  the 
number  of  bits  exchanged  between  P  t  and  P  ? 


In  other  words,  for  any  algorithm  that  solves  II  under  V  there  is  at  least  a  problem  instance  for  which 
P  ]  and  P ;  exchange  /  (77)  or  more  bits,  and  no  integer  larger  than  /  (17)  enjoys  the  same  property. 


We  also  define  the  information  exchange  for  a  class  H  of  assignments  as 


IH  -min/C-n).  3.12 

Information  and  Dichotomy.  Given  a  computation  graph  G  -  ( V  JE  )  and  a  dichotomy  D  —(V  j,V  2)  of 
its  nodes,  we  can  identify  P ,  with  the  subgraph  of  G  on  vertex  set  V,  (r  -  1,2).  This  choice  of  P ,  and 
P ^  defines  in  a  natural  way  an  I/O  assignment  r\(D  )  =  where  Ys  is  the  set  of  variables  input 

or  output  by  nodes  in  Vfs  »  1,2).  We  are  then  able  to  relate  the  notion  of  dichotomy  width  to  that  of 
information  exchange. 

Theorem  3.4.  Let  H  be  a  class  of  I/O  assignments  for  problem  II ,  with  information  exchange  I H  .  Let 
G  -  (V  JE )  be  a  computation  graph  that  solves  II  in  time  T,  and  let  8r  be  the  dichotomy  width  of 
T -\Drr)(D)eH\  .  Then. 

ST  2  lH!T.  3.13 

Proof.  If  D  =(VlfV,)€r  ,  then  r|  (Z?)€/f.  and  /  (rj  (Z> ) )  ^  IH  .  Thus.  V,  must  be  able  to 
exchange  I H  bits  with  V,  in  time  T.  and  therefore  must  be  connected  to  V,  by  at  least  IH  IT  edges. 
Hence,  for  each  D  6  f,  8  KD  )  ^  I H  iT ,  and  5r  =  min{S  fD ) :  D  €  H  ^  I»  !T  .  C 

AT 2  measure.  We  are  now  ready  to  state  a  result  of  major  importance  for  the  .AT 2  theory. 


Theorem  3 J.  Given  a  computation  graph  G  for  problem  IL  if  the  class  T  =  {D  :r\(D  )  )  generated 

by  a  class  H  of  I/O  assignments  satisfies  the  conditions  of  Theorem  3.3.  for  a  suitable  choice  of  mx,mz 
and  of  the  weighting  function  mivl .  then  the  following  lower  bound  holds  on  the  area-time  perfor¬ 
mance  of  G: 


A72  =  a  iL/o  . 

m , 


3.14 


proof.  It  suffices  to  combine  3.13  and  3.10. 


The  AT  -  lower  bound  3.14  is  a  far  reaching  result  because  for  many  interesting  computational  prob- 
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lems  we  are  able  to  (i)  find  a  class  H  to  which  Theorem  3-5  is  applicable,  and  (ii)  compute  or  bound  the 
information  exchange  I H . 

A  Format  for  H.  In  several  applications  it  is  convenient  to  focus  on  a  suitable  set  V.  of  I/O  variables, 
and  to  define  H  as 

H  =  |i 7  =  0yte):mx  <  l^n^l  m2J.  115 

If  we  let  mlv)  be  the  number  of  variables  in  V,  that  are  input  or  output  by  node  v  during  the  compu¬ 
tation,  then  the  class  T  of  dichotomies  associated  to  H  is 

r  =  l(V,y2):mj  £  m(v)  m2\,  116 

v«V, 

and,  if  m(v)  ^  m2-m1  +  l  all  the  conditions  are  satisfied  for  the  validity  of  the  bound 
AT2  =  fl  ((M  /mjlj  )  ,  where  M  is  the  total  number  of  variables  in  fy.  Thus,  classes  of  I/O  assign¬ 
ments  of  the  kind  specified  by  115  are  good  candidates  when  studying  the  area-time  complexity  by 
means  of  Theorem  3_5.  For  this  reason  we  further  investigate  the  nature  of  I H  . 

Some  Properties  of  l H  .  .An  interesting  case  of  class  H  is  obtained  from  116  when  771  j  =  771 2  =  * 


Hm  =  or \K  C0rx  I  =  m  |.  3.17 

In  fact  the  classes  Hm  enable  us  to  decompose  H  as 

H  =  Hmj,  3.18 

and,  if  we  denote  by  I  (m )  the  information  exchange  of  Hm  ,  we  can  write 

IH  =  mini  /  (m  t), . . . ,  /  (m  ;)).  3.19 

A  simple,  but  useful,  observation  is  that,  for  any  m  =1.2 . n,  we  have 

I  (m)  —  I  (m  — l)  ^  1.  3.20 

In  fact,  by  just  sending  a  bit  from  P  [  to  P ;  [from  P  2  to  P !  ] ,  we  can  always  transform  an  assignment 


‘  *'•  k'-  •*-  ,**  *.*•  .  *  .  *  ^  »  •  .  -  .  •  .  •  »‘*  V“  V  v  *  *  .**  V’ 

.  *  **  .  -  .*  .  -  •  .  *  •  •  *  •  «*  .  •  .  *  .*  .  *  •  .  •  v”  O  O  O  O  O  O.  O  »*  *  •  *  ♦  v'  sf  •  *  k'  .  •  .  •  •  *  -  *  «  ^ 


in  Hm  [  Hm  ]  into  one  in  Hm  _j  [  Hm  ] .  Using  the  fact  that  /  (0)=0  as  the  base,  and  Equation  3.20 
as  the  inductive  step  of  an  inductive  reasoning,  we  easily  prove  that  /  (m )  <  m  .  Another  interesting 
consequence  of  3210  is  that 

I  (m  2)  —  1 H  <  m2-m1,  3.21 

a  result  that  may  simplify  the  derivation  of  bounds  on  1H  . 

A  refinement  on  the  AT2  Bound..  The  fact  that  lH  is  related  to  I  (m2)  bv  3^1  suggests  the  possibility 
of  obtaining  a  bound  on  AT  2  directly  in  terms  of  1  (m  2) ,  a  quantity  easier  to  handle  than  I H  .  Such  a 
bound  is  indeed  provided  by  the  next  theorem.  As  we  shall  see  f rom  the  proof,  the  result  is  not  trivial, 
and  requires  the  combination  of  several  arguments. 

Theorem  36.  Let  G  be  a  computation  graph  for  problem  II .  Let  M  be  a  set  of  I/O  (binary)  variables 
of  II  .  of  cardinality  M.  If  Hm  is  the  class  of  the  assignments  such  that  exactly  m  variables  of  V  are 
assigned  to  P ;  .  and  1  (m  )  is  the  information  exchange  of  Hm  ,  then  there  exists  a  constant  k  such  that 

AT 2  ^  k  M  1 2  (m  )/m  =  Cl  (A/  /  2  (m  )/m  ).  3.22 

Proof.  Since  a  node  v  can  read  at  most  one  (binary)  variable  per  unit  of  time.  m(v  )  ^  T  ,  and  con¬ 
dition  m(v)  <  is  ensured  by  the  choice  —  T  in  the  definition  3.16  of  H. 

With  this  choice,  relations  3.13  and  3.21  imply  that 

lH  2  I  (m)  -  r. 

and  Theorem  3-5  (whose  hypotheses  are  all  satisfied)  yields  the  bound 

.•17 :  2*  W  ;i<m)-T  )2  m.  3.23 

for  some  constant  .  (If  we  retrace  the  proof  of  3.23  we  can  see  what  \l  =  1/25  will  do.) 

When  T  approaches  liml  from  below,  bound  3.23  may  become  weaL.  but  because  7  is  large  we 
expect  .17*  to  remain  large.  In  fact 

.17  ^  ( number  variables  be  input  or  output  by  G)  ^  M , 


and  we  have  another  bound 


AT2  >  M  T.  124 

Combining  123  and  124  we  obtain 

AT2  2  max (X,  A/  (I(m)-TY  /m, MT  }.  125 

To  prove  that 

max  {X,  3/  {l{m)-T)2/mJ4T)  >  \xMl2(m)/2m  126 

we  select  for  T  the  value 

T  2(m  M2  /  Cm  )  +  A  ), 
where  A  =  m  /Xi ,  and  we  argue  as  follows. 

(i)  For  T  ^  ra  MT  >  X!  MI2(jn)/2m  .  In  fact  /(m)  <  m  ,  and  X,  can  be  taken  ^  1/2  ,  so 
that  27 (m )  +  A  <  2  A  «  2 m/Xx.  Thus,ro  >  /2(m  )A2m/X, ) . 

(ii)  For  T  ^  To.Xi 3/(/(m)  —  TY/m  ^  \xMI2(m)/2m  .  Since  70  <  /(m)  the  function 
(/  (m  )  —  T  Y/m  in  the  interval  [  0  J 0  ]  is  decreasing,  and  achieves  its  minimum  at  T  *  Tn  .  The 

value  at  the  minimum  is  (I(m)  —  Tj2/m  =»  ^  ^  ,  which  vields  the 

A  4/+2A  2A 

desired  result. 

Equation  126  proves  the  theorem.  Since  X  *  Xt/2  ,  X  can  be  taken  to  be  1/50.  □ 

Remark.  The  value  of  T0  used  in  the  proof  is  an  approximation  of  the  (smallest)  root  of  the  equation 
in  the  unknown  T  obtained  by  equating  the  two  bounds  3^4  and  125.  The  exact  root  would  give  a 
slightly  better  bound  for  X  ,  at  the  expense  of  more  algebraic  manipulations. 

Remark.  Although  we  have  just  proved  that  bound  122  holds  for  any  T,  the  proof  itself  shows  that, 
for  T  >  T  o  ,  the  bound  is  weak,  and  that  AT  ^  M  provides  more  information  on  the  area-time 
complexity  of  problem  II .  However,  when  the  computation  is  slow,  the  complexity  is  usually  deter- 


mined  by  other  phenomena,  as  the  ones  to  be  discussed  in  the  next  section. 

Boundary  Chips.  W e  briefly  discuss  now  the  situation  where  all  the  I/O  ports  are  on  the  boundary  of 
the  layout  region,  which  for  simplicity  we  consider  to  be  a  rectangle  R  of  dimensions  Z,  and  Zv  .  In 
this  case,  as  we  have  mentioned  in  Section  3.1,  the  I/O  bounds  requires  p  -  fl  (A/  IT ) ,  where  M  is 
the  input  size.  Thus,  for  the  larger  side  of  the  rectangle,  say,  the  horizontal  side  of  length  lx  ,  we  have 
Z,  =  (l(M  IT)  .  On  the  other  hand,  we  have  also  seen  in  Theorem  13  that  both  Z ,  andZv  are  at  least 
ZKm  )  and  we  know  that  8(m )  ^  /  (m  )/T  .  Thus,  A  —  Z,  Zv  =  fl((A/  IT  X/  (m  )/7 ))  .  In  conclu¬ 
sion.  the  performance  of  boundary  chips  satisfies  the  bound 

AT2  =  C1(M  127 

Remark.  The  value  of  m  that  yields  the  best  bound  in  3J7  is  not  necessarily  the  same  that  would  give 
the  best  bound  in  3.22. 

3.3  SATURATION  AREA-TIME  LOWER  BOUNDS 

When  we  ideally  isolate  a  region  of  the  layout  of  a  VLSI  system,  not  only  is  the  bandwidth 
between  this  region  and  the  remaining  part  of  the  layout  bounded  by  the  perimeter  of  the  region,  but 
also  the  amount  of  information  that  can  be  stored  within  the  region  is  bounded  by  its  area.  This  fact 
has  important  consequences  for  the  area-time  performance  of  some  computations.  In  this  section  we 
develop  techniques  to  express  these  effects  in  a  quantitative  manner. 

In  formation- Exchange  Under  Bounded  Storage.  W7e  consider  again  the  by  now  familiar  framework 
m  which  two  processors  P ,  and  P  2  cooperate  to  solve  a  given  problem  II .  However,  we  add  a  new  ele¬ 
ment  to  the  picture  by  assuming  that  only  a  limited  amount  of  storage  is  available  in  each  processor. 

Storage  limitations  may  affect  the  information  exchange.  In  fact  during  the  computation,  one  of 
the  two  processors  may  fill  its  storage  (a  situation  referred  to  as  "saturation")  and  hence  be  forced  to 
send  some  information  to  its  mate  for  temporary  storage.  At  a  later  time,  this  information  will  return 
to  the  original  processor,  wnen  its  memory  is  no  longer  saturated.  Each  bit  involved  in  this  process  goes 
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back  and  forth,  contributing  twice  to  the  information  exchange. 

These  considerations  lead  to  the  following  formal  definition  of  the  information  exchange  of  II 
with  respect  to  a  given  I/O  assignment  i)  under  the  condition  that  P 1  and  P2  can  store  at  most  s  j  and 
s  2  bits  of  information  respectively. 

/  ( t)  |  Sj,  s- )  A  the  minimum  over  all  algorithms  (that  solve  II  under  assignment  t)  ,  and  under 
storage  bounds  r,  and  s2  for  P j  and  P2  X  of  the  maximum  over  all  the  problem 
in«t*nrg«  of  the  number  of  bits  exchanged  between  P  j  and  P  . 

Similarly,  the  information  exchange  for  a  class  H  of  assignments  can  be  defined  as 

^min/Ci?  Irlrr2)  129 

ip  H 

Remark.  The  functions  I  (77  I  r  Irr2)  and  IH(s  ltj2)  are  both  nonincreasing  in  each  of  the  two  variables 
s  1  and  s2 .  (Mare  storage  never  hurts.) 

As  in  previous  sections,  of  particular  interest  is  the  family  of  assignment  classes  defined  with 
respect  to  a  suitable  set  of  I/O  variables  of  the  problem,  that  is 

Hm  *  {7)»<rit:y:IKU  V,  I  »m|.  3.30 

For  convenience  of  notation,  we  write  1  (m  I  j  lrs  2)  instead  of  I am^s  3)  • 

The  Square-Tessellation  Technique.  We  now  show  that  by  combining  bounds  on  the  information 
exchange  with  bounded  storage  with  the  square-tessellation  technique  we  can  obtain  area-time  lower 
bounds. 

We  recall  from  Section  3.1  that  a  computation  graph  G  is  to  be  laid  out  on  the  layout  grid,  and 
that  it  is  also  useful  to  introduce  the  auxiliary  grid  whose  vertices  are  the  centers  of  the  elementary 
cells  of  the  layout  grid. 

Let  us  consider  on  the  auxiliary  grid  a  square  cell  with  a  side  of  length  l  as  shown  in  Figure  3_5. 
We  can  identify  the  part  of  the  graph  laid  out  within  the  cell  with  processor  P  t ,  and  the  part  laid  out 


Figure  3.5.  A  cell  of  the  square  tessellation  can  be  viewed  as  a  processor  P ,  ,  with  storage  bounded 
by  l2.  Identifying  P2  with  the  rest  of  the  layout,  the  bandwidth  between  P j  and  P ,  is 
bounded  by  4^. 

outside  the  cell  with  processor  P2  •  Obviously,  the  storage  of  P  {  is  upper  bounded  by  l 2  .  The  storage 
of  P2  is  also  upper  bounded  by  A  —l2  ,  where  .4  is  the  area  of  the  layoug  of  the  graph.  However,  m 
the  sequel  we  will  not  make  use  of  this  bound. 

If  m  variables  of  "U  are  input  or  output  (by  nodes  of  the  computation  graph  laid  out)  within  the 
celL  then  the  information  exchange  across  the  boundary  of  the  cell  is  at  least 

I  •  m  l  '-4  —  l*)  5  /  {m  I 1  oo). 

Since  the  perimeter  of  the  cell  is  4(.  we  can  conclude  that 

T  /(mi(2.+  oo)/4(.  3.31 

Given  a  tessellation  of  the  layout  region  with  square  ceils  of  side  i.  we  can  argue  that,  since  M  =  "j, 
variables  are  input  or  output  in  area  .4,  there  exists  at  least  a  cell  C  of  the  tessellation  for  which  the 
number  of  variables  of  "u  handled  bv  C  is 
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m  2  Ml2/ A.  132 

If  we  could  find  a  cell  of  the  tessellation  for  which  m  is  exactly  Ml2/ A  ,  we  could  state  the  bound 
T  >  I  {Ml2 /A  I  l2,oo)/4l. 

from  Equation  3J1.  which  is  indeed  an  area-time  lower  bound.  A  cell  with  m  exactly  equal  to  Ml2/ A 
does  not  always  exist,  but  we  can  argue  as  follows  to  obtain  a  bound. 

We  know  that,  since  at  most  T  variables  can  be  input  or  output  by  a  node  in  time  T,  there  exists  a 
suitable  zig-zag  line  that  cuts  the  cell  C  into  two  regions  one  of  which  inputs  and  outputs  a  number 
(Ml2/ A  +h)  of  variables  of  V.,  with  0  ^  h  <  T  .  Moreover,  the  perimeter  of  this  region  is  at  most 
41  and  the  area  is  at  most  l2  .  Thus,  we  can  claim  that  an  amount  I  (Ml2/ A  +  h  \l2,  eo )  of  informa¬ 
tion  must  cross  the  boundary  of  this  region.  Hence, 

T  £  I  (Ml2/ A  I  l2,ao)/4l,for  some  h  €  [OJ*  —  lj.  133 

We  can  formally  summarize  the  preceding  discussion  by  stating  the  following  theorem. 


Theorem  37.  Let  G  be  a  computation  graph  for  problem  II .  Let  “U  be  a  set  of  I/O  (binary)  variables 
of  II  ,  of  cardinality  M.  If  Hm  is  the  class  of  assignments  such  that  exactly  m  variables  of  V  are 
assigned  to  P  lt  and  l(m  I  s ,  oo  )  is  the  information  exchange  of  Hm  when  P !  has  s  bits  of  storage,  then 
the  area-time  performance  of  any  layout  of  G  satisfies  the  bound 

T  ^  min  I(Ml2/A  +A  l/2,eo)/4L 
o< a  <r 

Proof.  Obvious  from  Equation  131 


Remark.  To  obtain  the  best  possible  bound  from  134  we  must  choose  the  value  of  l  that  maxima 
the  right  hand  side.  (We  can  choose  l  as  we  wish,  since  the  inequality  holds  for  arbitrary  l .) 


Remark.  In  most  cases  in  the  range  of  interest,  I(m  I  s , » )  is  increasing  with  m,  so  that  the  minimum 

«  ■ 

in  134  is  achieved  for  h  *0.  Then  we  can  state  the  bound  as  ~ 
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T  >  max  I  (Ml 2/A  I l\oo  )/4 1.  3 35 

In  all  applications  of  3.33  made  m  what  follows  (Theorems  4.5,  4.10,  and  4.18)  it  turns  out  that,  for  the 
value  ln  of  l  that  maximizes  the  lower  bound.  Km  1 ,  co  )  =  B\ m  ,  where  Bi  is  a  constant.  Then  we 
can  write  the  bound  as 

AT  ^  0  Ml  0,  3.36 

with  0  —  i3i/4  .  Usually  ln  is  an  increasing  function  of  the  problem  size,  and  therefore  3.36  is  a  better 
bound  than  the  straightforward  I/O  bound  AT  =  C1(M  ) . 
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CHAPTER  4 


LOWER  BOUNDS 

FOR  CYCLIC  SHIFT  AND  SORTING 

4.1  CYCLIC  SHIFT 

Several  lower-bound  arguments  for  sorting  are  based  on  the  fact  that  sorting  circuits  are  capable 
of  performing  cyclic  shifts  on  a  suitable  sequence  of  words.  In  this  section  we  derive  some  results  on 
the  information  exchange  and  the  AT 2  measure  for  the  cyclic  shift  problem.  This  will  afford  us  the 
possibility  to  illustrate  the  lower  bounds  techniques  of  the  preceding  sections  by  applying  them  to  a 
relatively  simple  problem. 

Shift  arguments  have  been  first  proposed  by  [BKSO]  and  [AA80]  to  lower  bound  the  AT2  perfor¬ 
mance  of  integer  multipliers,  and  they  have  also  been  successfully  applied  to  sorting  by  [LS4]  whose 
results  will  be  reviewed  in  Section  X22. 

Definition.  The  input  of  the  (n^)-cyclic  shift  problem  is  a  pair  ipX)  where  p  is  an  integer  between  0 
and  n-1,  and 

Z  =  I Zj:  i  *0,1.. ...n—l  \j  *0,1,...,!?— 1} 

is  an  n  xq  array  of  n  words  of  q  symbols  each.  The  output  is  an  array  W  with  the  same  format  as  Z. 
such  that 

W'J  *  Zli-p)modn 

In  the  following  we  shall  assume  that  Z  has  binary  entries.  No  assumption  is  made  instead  on  the 
encoding  used  for  the  integer  p,  the  size  of  the  shift. 


Information.  Exchange.  With  reference  to  the  framework  of  Section  3.2,  let  t)  =  ( V»)  be  an  I/O 
assignment  for  the  (n.?  Acyclic-shift  problem.  We  need  to  define  a  number  of  quantities  that  are  func¬ 
tions  of  T) : 

bj  —  number  of  input  words  whose  y'-th  bit  is  input  by  P , , 

Cj  -  number  of  output  words  whose  j  -th  bit  is  output  by  P , , 

B  -b  0+6  jH - 1-6?  _j  (global  number  of  Z  entries  input  by  P ,) , 

C  -c0+c ,+  •  •  •  +c?_i  (global  number  of  W  entries  output  by  P ,) . 


For  a  given  shift  p,  it  is  easy  to  see  that  input  position  li,j I  contributes  one  bit  to  the  information 
exchange  if  and  only  if  ZJ  and  W  „  are  assigned  to  different  processors.  An  immediate  conse¬ 

quence  is  that 

/(t))  £  I  B  -C  l.  4.1 

Let  6 ,  be  the  information  exchange  due  to  the  j  -th  position  summed  over  all  the  n  different  shifts. 
We  claim  that 


<b,  =b. in— Cj)  +  (n— bj)cj  4.2 

In  fact,  each  of  the  b ,  bits  input  by  P x  is  output  by  P  2  in  ~  e . )  times,  and,  symmetrically,  each  of 
the  n  —b  bits  input  by  P2  is  output  by  P  j  c;  times. 


By  the  pidgeon-hole  principle  there  is  a  shift  size  with  information  exchange  not  smaller  than  the 
average,  which  ensures  that 


<j  —  i 

/  (ri)  2-  1  in  £  o. . 

;  =0 


4.3 


We  can  then  derive  bounds  on  /  (rj)  if  we  are  able  to  bound  the  <b .  ’s.  With  the  motivation  that  d> 
tends  to  be  large  when  the  output  bits  of  position  j  are  about  equally  split  between  P  i  and  P : ,  We 
classify  the  positions  as  follows.  For  given  y  €[0,1/2]  we  define 

Q  o-i  /  :  yn  <  c  <  ( 1  —  y  Vi  ( 


4.ia 
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Cj  >  (1  -yh. }, 


Q2-{}‘  Cj  <  yn  ), 


9s  -  I  Q,  I.  s  =  0,1,2,  90  +  91+92  =  9 


B,  *  Z  bj.  *  =  0.1,2. 


We  can  think  of  positions  in  Q0  as  “balanced",  and  positions  in  Q,  as  P,  -biased,  s  -  IX  We  show  now 
that  the  information  exchange  of  a  given  assignment  is  at  least  proportional  to  the  number  of  balanced 
positions. 

Theorem  4.1.  The  information  exchange  of  any  assignment  17  for  the  ln,?>-cyc  lie-shift  problem  satisfies 
the  inequality 

7(77)  >  yqgn.  4.7 

Proof.  If  j  €Q0 .  then  both  c;  and  n  —Cj  are  £  yn  ,  and 

<f>j  =6;(n  —  c>)  +  (n  —  bj)cj  ^  bjyn  +  (n  —bj)yn  -  yn2.  4.8 

Combining  the  last  inequality  with  4.3  we  obtain 

/(7))>(l/n)  J  ^  ^  {l/n  )qoyn2  -  yq</u3 
;«<2o 

Theorem  4.1  implies  that  q0  <  I  /{yn ) ,  and  hence  that  q  j  +  q2  ^  q—I  /{yn ) . 

However,  in  some  applications  we  need  a  bound  on  q  j  (or  q  2)  alone,  which  can  be  obtained  in 
terms  of  the  total  number  of  inputs  of  P 1 ,  as  shown  by  the  following  theorem. 

Theorem  4.2.  The  information  exchange  I  (17)  of  any  assignment  77  of  the  injqhcyc Lie -shift  problem, 
such  that  P 1  reads  exactly  B  entries  of  Z,  satisfies  the  bound 

/  (77)  >  yCB  -  B  j),  4.9 


and,  since  B .  <  n  q  lP 


«-  ■ 
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g  i  >  (B  —  I  (t ))/y)/n.  4.10 

Proof.  For  j  €(20UQ>  n  —  c,  ^  yn.  Thus 
4>j  ^  -c;)  £  6;Vn,  ;  €C0U(22- 

Then  inequality  4.3  yields 

?  -1  y 

/(tj)  ^  (i/n)£0,  £  (,i/n)j  eoZuQkjy*  =  v^o  +  Sj)  =  y(B  -Bi)n 
AT  2  Bounds.  We  begin  with  a  simple  but  important  result  due  to  [BK80]  and  [AA80] . 

Theorem  43.  For  the  (n,l  1-cyclic-shift  problem 

.47 2  =  fl(n2).  4.11 

Proof.  From  Equations  4.3  and  4.8,  for  q  -  1,  we  obtain 
/  (7))  ^  (60(n  —  e0)  +  (n  —  60)c0)/n. 

If  H  ={t)  :  =  n  /2|  ,  then  /  (tj)  ^  n  12  for  any  r)€H  .  Then  / h  ^  n.  12  ,  and  the  proof  is  completed 

by  recalling  from  Theorem  3.6  that  AT2  =  Q Uff) .  G 

If  we  try  to  extend  the  previous  result  to  words  of  arbitrary  length  q,  we  immediately  realize 
that  bipartition  techniques  are  not  sufficient.  For  example,  we  can  construct  a  balanced  assignment  r\ 
with  b,  =  c ,  =  n  for  /  ^  9/2  —  I  ,  and  6,  =  c,  =0  for  j  ^  q  12  .  /  (tj)  is  clearly  zero  (if  we 
neglect  the  information  exchange  related  to  the  shift  size).  The  point  is  that  each  bit  position  can  be 
processed  independently  from  the  others,  so  that  information  exchanges  remain  confined  to  small  sets  of 
I.  O  variables.  This  is  a  typical  situation  in  which  the  square  tessellation  technique  reveals  its 
effectiveness. 

Theorem  4.4.  For  the  (n,g)-cyclic  shift  problem 

.47 2  =  f l(q  n2).  4.12 

Proof.  Let  -(Z/  :  i  =0,...„i  - 1;;  =0,...^  -l!  ,  and  M  -  l^lwui.  Let  H  -It]  :  P .  reads 


exactly  n/2  input  variables  }.  We  recall  from  Theorem  4.2  that,  for  y€[0,l/2l  /( T))  >  y(B  —  Bx) , 
and  we  distinguish  to  cases: 

(i)  Q i  is  empty,  so  that,  being  Bx  -  0  and  B  -  a/2,  we  obtain 
7(1))  >  (y/2)n. 

(ii)  (2 1  is  not  empty,  so  that  at  least  for  one  j,Cj  >  (l  -  y)a.  Thus  C  ^  c}  >  (l  —  y)rt,and 
recalling  4.1  we  obtain 

-B  >  (l— y)a  -1/2 a  =  (l/2-y  )n. 

If  we  choose  y  =  1/3  ,  we  see  that  in  both  cases  1  (tj)  >  a  /6.  In  conclusion  /«  >a  /6,  and,  from 
Theorem  3.6,  with  M  -  nq,  m  •  B  -  n/2,  and  >  a  /6,  we  obtain 

We  will  see  next  that,  for  suitably  slow  computations,  the  saturation  technique  allows  us  to  derive 
better  bound  than  that  one  provided  by  Theorem  4.4. 

Information  Exchange  Under  Bounded  Storage.  We  consider  now  the  case  when  P  t  has  a  storage 
capacity  of  s  bits,  and  we  show  that,  if  s  <  a,  the  storage  limitations  really  affect  the  information 
exchange.  First,  we  need  to  prove  a  simple,  but  important  lemma. 

Lemma  4.1.  In  a  cyclic  shifter  with  a  place-determinate  I/O  protocol  all  bits  of  Z  1  must  be  input 
before  any  bit  of  W i  can  be  output. 

Proof.  For  two  arbitrary  indices  i  i  and  i2  (0  <  iLi2  ^  a  —  1) ,  we  can  find  a  suitable  shift  size  such 
that  W,'  =  Z,/ .  Thus.  W/2  cannot  be  output  before  is  input. 

Remark.  A  simple  consequence  of  this  lemma  is  that,  for  every  there  is  a  time  t  i  such  that  ail  bits  of 
Z  ‘  have  been  input,  and  no  bit  of  W 1  has  been  output.  Then  a  bits  describing  Z  1  ire  stored  in  the 
shifter  at  time  t;  .  It  should  be  clear  that  these  a  bits  need  not  necessarily  be  >  for  the 
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system  is  free  to  encode  data  arbitrarily  for  intermediate  steps  of  the  computation.  However,  since 
2  .  2-1 are  unrelated,  any  encoding  of  them  requires  at  least  n  bits  for  a  suitable  value  of  the 

variables,  i.e.  for  a  suitable  problem  instance. 

Theorem  4J.  Any  (n^)-cyciic  shifter  satisfies  the  bound 

AT  ^  Sliqn  'fn  ).  4.13 

Proof.  Let  7)  be  an  I/O  assignment  such  that  Pt  outputs  c,  bits  of  position  W  J  .  If  cy  >s  ,  then  at 
the  instant  ry  (when  all  inputs  of  Z 1  have  been  input,  but  no  output  of  W  1  has  been  released)  at  least 
c  —s  of  the  bits  that  have  to  be  output  by  Px  are  stored  in  P2  •  Eventually,  these  bits  have  to  be 
transferred  to  Px  in  order  to  be  output,  thus  cotributing  an  amount  (cy  —s)  to  the  information 
exchange.  If  we  let  (x )+  denote  x  when  x  >  0,  and  zero  otherwise,  we  can  write 


/  (t|  i  r  ,oo)  ^ 


q  —1 

Z  (c,  S  )*. 


4.14 


If  we  consider  a  value  s  <  (l— y)n  ,  for  some  y€  [0,1/2] ,  and  we  recall  that  Q !  =  !  y  :c;  >  (l— y)n  I  , 
and  that  qx  =  i  Q  t  I  then  Equation  4.14  easily  yields 


/(rjir.oo)^  Z  (cy—  Sj)  ^  ?i((l— y)n  —  s).  4 

J*Q  i 

From  Theorem  4.2,  we  also  know  that  I  (7j) ,  and  a  fortiori  /(r|  I  s  ,oo) ,  satisfies  the  bound 

I  (tj  I  r.oo)  >  y(B— nq  j),  4.16 

where  B  is  the  number  of  bits  input  by  P  ;  according  to  I/O  assignment  i)  .  A  linear  combination  of 
bounds  4.1^ and  4.1$  with  coefficients  y  and  (1  —  y  —  s  In )  respectively,  yields 

/  (tj  I  s  ,oo)  ^  y(l  —  y(l—  s  /n  ))B  4.17 

where,  as  usual,  0  <  y  ^  1/2  and  0  ^  s  in  ^  1—y  .  The  best  bound  is  obtained  when  y  —  (1—  s  !n  )/2 
and  is 


system  is  free  to  encode  data  arbitrarily  for  intermediate  steps  of  the  computation.  However,  since 
2  .  2-i-\  are  unrelated,  any  encoding  of  them  requires  at  least  n  bits  for  a  suitable  value  of  the 

variables,  Le.  for  a  suitable  problem  instance. 


Theorem  43.  Any  (n^)-cyclic  shifter  satisfies  the  bound 

AT  ^  fl($n  -Jn  ).  4.1 3 

Proof.  Let  tj  be  an  I/O  assignment  such  that  P  x  outputs  Cj  bits  of  position  W  1  .  If  cy  >j  ,  then  at 
the  instant  r;  (when  all  inputs  of  Z 1  have  been  input,  but  no  output  of  W  J  has  been  released)  at  least 
Cj—s  of  the  bits  that  have  to  be  output  by  P x  are  stored  in  P2  •  Eventually,  these  bits  have  to  be 
transferred  to  Px  in  order  to  be  output,  thus  cotributing  an  amount  (c;  —  s)  to  the  information 
exchange.  If  we  let  (x )+  denote  x  when  x  >  0,  and  zero  otherwise,  we  can  write 

?  -I 

/  (tj  I  s  ,oo)  >  £  (c J  — J  )+. 

,-*0 

If  we  consider  a  value  s  <  (1— y)n  ,  for  some  y€  [0,1/2] ,  and  we  recall  that  Q  i 
and  that  q  x  «  I  Q ,  I  then  Equation  4.14  easily  yields 

/(■qtr.oo)^  £  (cy  — Sj)  ^  <7 tC(  1 — y)n  —  s  ). 

j*Q  j 

From  Theorem  4.2,  we  also  know  that  /  (tj)  ,  and  a  fortiori  /  (tj  I  s  ,oo) ,  satisfies  the  bound 

/ (tj  !  j.oo)  >  y(B- nqx),  4.16 

where  B  is  the  number  of  bits  input  by  P  L  according  to  I/O  assignment  tj  .  A  linear  combination  of 
bounds  4.1^ and  4.1$  with  coefficients  y  and  (1  —  y  —  s/n)  respectively,  yields 

/  (tj  I  s  ,oo)  ^  y(l  -'/1—j  /n  ))B  4.1" 

where,  as  usual,  0  ^  y  ^  1/2  and  0  <  s  in  ^  1— y  .  The  best  bound  is  obtained  when  y  -  (1—  s  in  )/2 


4.14 

*  1/  ‘.Cj  >  (1— y)n  )  . 
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1  (-T)  I  s  ,oo)  >  2.(1— s  !n  4*1® 

4 

Then,  by  applying  Theorem  3.7  to  the  class  of  assignments  t)  such  that  P  j  inputs  exactly  B  bits,  and 
recalling  that  in  our  case  M  -  nq,  we  obtain 


T  > 


min  Ul-lVn) 
o<*  <r4 


and  thus 


T  2  -L(l-l2/n)lnq /A.  4.19 

16 

The  best  bound  is  obtained  for  l  =  Vn/3  and  is 

AT  ^  Vn"  a  0(971  >/n  X  420 

where  0x  =  lA24>/3) .  □ 

From  Equations  420  and  4.12  we  have  seen  that  for  any  (nqPcyclic  shifter,  and  constants  0j  and 
02  ,  AT  >  0x97i  VtT  ,  and  AT2  >  029ti2  .  The  latter  bound  is  stronger  for  T  <  (02/0,)>/n"  ,  while 
the  former  is  stronger  for  T  >  (02/0x)>/7r  .  This  fact  indicates  that  the  complexity  of  cyclic  shift  is 
dominated  by  pure  information  exchange  for  relatively  fast  computations,  but  is  affected  by  storage 
limitations  for  slower  computations. 


42  SORTING. 

We  are  finally  ready  to  apply  the  general  techniques  described  in  the  preceding  sections  to  the 
derivation  of  area-time  lower  bounds  for  the  sorting  problem. 

For  the  purposes  of  this  section  we  classify  sorting  problems  according  to  the  relationship  between 
the  length  k  of  t^e  hey  and  the  number  n  of  keys.  There  are  three  cases  that  need  to  be  analyze.4 
separately,  and  for  which  we  introduce  the  following  terminology: 


.  J 


I 


short  keys: 


1  ^  k  ^  logn 


421a 


t*' 
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\ 

I 


V 


i 

p 


t' 


medium -length  keys:  logn  <  k  <  2logn  4.2 lb 

long  keys:  2 logn  <  k.  4.21c 

Between  this  classification  of  sorting  problems,  and  the  classification  of  multisets  according  to  the  range 
of  optimality  of  different  encoding  schemes  there  is  an  intimate  relationship,  which  will  become  more 
and  more  apparent  as  we  proceed. 

In  each  of  the  three  cases  defined  above  we  will  derive  several  lower  bounds  using  different  tech¬ 
niques.  The  bounds  will  be  of  the  AT  2  type  when  the  dichotomy  or  the  square  tessellation  techniques 
are  used  in  combination  with  the  unconstrained  information  exchange,  and  will  be  of  the  AT  type 
when  the  square  tessellation  technique  is  combined  with  the  saturated  information  exchange. 

The  dichotomy  technique  gives  satisfactory  results  only  for  keys  of  medium  length,  whereas  the 
square  tessellation  technique  yields  the  best  bounds  for  short  and  long  keys. 

AT 2  and  AT  bounds  complement  each  other  in  the  sense  that  the  former  are  better  for  T  <  T  0 , 
and  the  latter  are  better  for  T  >  T  2  ,  when  T  0  is  a  suitable  computation  time  for  which  the  two 
bounds  coincide. 

Notation.  We  recall  from  Chapter  1  that  the  input  of  the  (njchsorting  problem  can  be  viewed  as  an 
n  xk  array  of  binary  variables 

X  -  \X,J:  i  =0,1 . n  — l;y  =* -U  -2,...,0i, 

where  X.J  is  the  coefficient  of  2-  in  the  oinary  representation  of  the  i-th  input  key.  The  i-th  row  of  X. 
X.  .  represents  the  i-th  input  key,  and  the  j-th  column  of  X,  X  ■'  ,  represents  the  k h  '.east  significant 
position.  A  similar  notation  is  adopted  for  the  ouput  array  Y. 

Remark.  Here  and  hereafter  n  is  generally  assumed  to  be  a  power  of  2.  While  this  simplifies  the  treat¬ 
ment  of  several  details,  this  assumption  is  not  a  serious  restriction  for  asymptotic  anivsis.  In  fact  the 
complexity  of  sorting  n  keys  (n  being  here  an  arbitrary  integer)  is  never  smaller  than  the  complexity 
of  sorting  n  ,  keys,  where  n  .  is  the  largest  power  of  two  not  exceeding  n,  and  it  is  never  larger  than  the 
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complexity  of  sorting  n2  keys,  where  n  2  is  the  smallest  power  of  two  not  smaller  than  n. 

Finally,  we  recall  that  r  *  2*  is  the  cardinality  of  the  set  from  which  keys  to  be  sorted  are 
drawn. 

4.2.1  Short  Keys. 

Let  P  x  and  P2  be  two  processors  cooperating  in  solving  an  fnjfcj-sorting  problem,  with  k  <  logn 
.  To  give  some  intuition,  let  us  consider  the  situation  in  which  P  \  reads  keys  X JCn  /2-i  ,  and  P 2 
reads  keys  Xnl2,...  JC„ _i ,  and  let  us  try  to  estimate  the  information  exchange  /  necessary  to  complete 
the  sorting.  We  can  argue  as  follows.  From  the  analysis  of  the  encoding  of  multisets  developed  in  Sec¬ 
tion  2 -2,  we  know  that  each  processor  can  encode  its  input  using  (Krlog(  1  +  n !r  ) )  bits,  and  send  the 
encoding  to  the  other  processor.  Then  each  processor  obtains  complete  information  about  the  input,  and 
can  compute  all  the  outputs  it  is  required  to  produce  without  further  communication  with  its  mate. 
Thus,  we  can  conclude  that  I  =  Oirlog  (1  +  n/r))  .  It  would  not  be  difficult  to  prove  that  in  this 
situation  the  outlined  algorithm  minimi7es  the  information  exchange,  and  that  indeed 
I  =  (Krlog  (l  +  n/r))  .  However,  the  class  H  of  the  assignments  such  that  P  j  reads  exactly  n/2 
input  keys  does  not  have  the  properties  to  guarantee  AT 2  =  fl(/„2) ,  because  there  might  be  no  cut  of 
the  layout  such  that  (nearly)  half  of  the  keys  are  input  on  either  side  of  the  cut,  unless  we  assume  a 
word-local  protocol  (see  Section  1.1 ).  In  the  next  theorem  we  circumvent  this  difficulty  by  restricting 
our  attention  to  the  least  significant  bit  position  of  the  input.  This  choice  is  not  random,  and  is  sug¬ 
gested  by  the  fact  that  insert-and-prune  encodings  allow  the  reconstruction  of  the  entire  input  by  look¬ 
ing  only  at  the  least  significant  bit  position  of  the  output  (see  Theorem  2.1).  The  following  result  has 
been  obtained  independently  by  [Sg84a], 

Theorem  46.  Any  VLSI  (n^)-sorter,  with  k  $  logn  ,  satisfies  the  bound 

AT2*  I)  (r^og^l  +n/r )),  4.22 


where  r  *  2*  . 


Proof.  With  reference  to  the  general  framework  of  Section  3.2  let  us  consider  the  set 
-  iX,°  :  i  =  0.1, . . .  ji  —  1 1  of  the  bits  in  the  least  significant  portion  of  the  input  keys.  Let  H  be  the 
class  of  I/O  assignments  such  that  exactly  n/2  of  the  variables  of  V  are  read  by  P  j ,  and  let  1  be  the 
information  exchange  of  H.  Because  of  Theorem  3.6,  to  prove  Equation  4.22  it  is  enough  to  show  that 
7*0  (rlog  (l  +  n  /r  ) ) . 

Given  7)6 H  ,  we  can  assume,  without  loss  of  generality,  that  the  n/2  members  of  lA  input  by  P  t 
belong  to  keys  X  <+X  lt . . .  JCn  /2_i  .  We  also  divide  both  the  input  and  the  output  keys  into  r/2  seg¬ 
ments  of  2n/r  consecutive  words  each  (see  Figure  4.1) : 

x-seg(.h)^[Xh2nir+i  :<i  =0,1,. ...2n/r  -1),  4.21 

v  -seg  (fi  )  -  IF*  2«  /r  : q  -  0,1 . 2 n/r  - 1},  424 

for  h  =  0,1, —  1.  We  ay  that  y-segthl  is  P,  -biased  (s  -  12)  if  at  least  half  of  the  Ls.b.  of  keys  in 

the  segment  are  output  by  Ps  .  There  is  one  processor,  ay  P2  ,  such  that  there  are  at  least  r> 4  mdices 
h  ,Ji  j, . . .  Jir  /4_i  for  which  y-segi h)  is  P2-  biased.  Let  hr  l4Jir  . . hr  ,2_,  be  the  remaining  indices. 

We  now  construct  a  subproblem  of  sorting  by  setting  all  the  bits  of  each  input  key.  except  the 
least  significant  one,  to  a  constant  value,  such  that 

Xp  2 nir+p  =  2 hp  +  Xp2n  Ir +1  P  ~  0,1, . . .  jr  /2  —  1,  q  -  0,1,  . .  .  ,2/1  /r  —  1, 

where  Xs2rt  /r  ^  is  arbitrary.  In  the  corresponding  output,  y-seg  (hr  )  contains  the  sorted  sequence  of 
x-segi  p).  with  the  k- 1  leading  bits  of  each  key  representing  h?  .  The  Ls.  bits 

y/'}2n  r . for®  a  string  of  zeros  followed  by  a  string  of  consecutive  ones.  The 

number  of  zeros  :?  obviously  equals  the  number  of  variables  X"^  Jtfy,  ,r  ^  ir-\  which  are 
zero.  Thus,  there  are  2n/r+l  possible  outcomes  for  each  y-seg.  The  situation  is  illustrated  in  Figure  4.1. 

Let  us  now  focus  on  values  of  h?  with  p  <  r/4.  All  the  1a  bits  of  x-segi  p)  are  input  by  P  j ,  and 
at  least  n/r  Ls.b.  of  y-seg  (h,)  are  output  by  P2  .  Thus,  P2  is  capable  to  produce  at  least  n.r^  1 
different  outcomes,  and  therefore  to  distinguish  among  n/ri-1  intervals  in  which  z,  can  fall.  This  is 
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Figure  4.1.  Input  and  output  arrays  in  the  proof  of  Theorem  4.6. 

possible  only  if  logfrL'r+l)  bits  relative  to  x-seg( p)  are  communicated  to  P2  and  P  x  .  This  being  true 
for  r/4  unrelated  segments,  we  conclude  as  desired,  that 
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1  ^  r  /4  login  lr  +  1)  »  fl  (r  log  (.n/r  +  l)  ).  O 
If  we  combine  the  bound  in  Equation  4.25  with  the  bound  in  Equation  127  on  the  performance 
of  boundary  chips  (in  our  case  M  -  nk)  we  obtain  the  following  result. 

Theorem  47.  Any  VLSI  /njfcJ-sorter,  with  k  ^  logn  ,  and  with  all  its  I/O  ports  on  the  boundary, 
satisfies  the  bound 

AT 2  *  Clinkrlog  ( 1  ■¥  n/r))  4.26 

where  r  *  2*  . 

We  have  seen  that  for  a  protocol  assigning  half  of  the  input  keys  to  each  processor  the  informa¬ 
tion  exchange  is  /  *  (Krlog  ( 1  +  n/r)) ,  but  the  situation  can  be  dramatically  different  for  other  proto¬ 
cols,  as  illustrated  by  the  next  theorem. 


Theorem  4.8.  Given  an  (njfcJ-sorting  problem,  with  k  <  logn,  let  H  be  the  class  of  the  I/O  assignments 

such  that  P !  inputs  bit  positions  X  °,  X l, . . .  Jf k  /2_l  and  P ;  inputs  bit  positions  X k  12 ,  X  ‘ /2+1 . X*~l 

ik  is  even  for  simplicity).  Then  the  information  exchange  of  H  is 

I  =  Mkn  ).  4.27 


B  Proof.  We  plan  to  transform  the  string  equality  problem  to  our  sorting  problem.  In  the  string  equal¬ 

ity  problem  there  are  two  input  strings,  W  2  and  W ,  ,  of  length  h  each,  and  there  is  one  output  bit 
which  is  1  if  and  only  if  W  x  =  W  2 .  It  is  easy  to  show  that  if  processor  Ps  inputs  string  W,  ,  s  -  L2, 
then  the  solution  of  string  equality  requires  an  information  exchange  I  •  h  (see  for  example.  [Y79l  and 
[Y1SS2J). 

To  carry  out  the  transformation  we  set  the  first  r  input  keys  (r  <  n)  to  the  constant  value 
X,  -  i,  li  -  0.1, r  —  1).  From  Theorem  2.1  we  know  that  output  position  V°  is  sufficient  to  recon¬ 
struct  the  entire  input  multiset. 

Let  us  now  define  the  strings  W  2  and  W  2.  W  t  is  the  row-major  spelling  of  the  array 


X,:i  »r,...n-l;/'  *  k  -1...  .Jc  ,2|. 


5 


W  2  is  the  row -major  spelling  of  the  array 

1 ;  j  =  *  n— 

(Refer  to  Figure  43).  It  is  clear  that  W  t  =  W  2  if  and  only  if  X/  =  X/  +* 12  for  i  -  r, . . .  /i-l,  and  j 


Xk-i  .  .  .  j(k/2  x^-1  •  •  •  X® 


Figure  43. 


Configuration  of  the  input  array  X  in  the  proof  of  Theorem  4.8. 
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0 , . . .  Jc  /2-1.  This  is  equivalent  to  saying  that  each  key  X,  ( i  ^  r  )  in  the  input  multiset  is  the  con- 
catenation  of  two  identical  strings,  a  property  of  the  multiset  which  is  independent  of  its  representa¬ 
tion,  and  can  therefore  be  verified  once  K0  is  known.  Now  let  P,  be  the  processor  that  outputs  more 
variables  of  Y°  (break  a  tie  arbitrarily  ).  Let  Ps  and  P  3_,  sort  the  input  by  exchanging  1 '  bits,  and  let 
P 3_,  receive  /  *  ^  n/2  bits  to  describe  the  components  of  Y°  that  it  is  required  to  output.  With  no 
further  communication,  P,  is  now  able  to  decide  equality  of  W ,  and  W  2  ,  whose  length  is 
h=(n-r)k/2.  Then/'  +  /*  >  (n  -r  )*  /2 .  and  /  *  £  (n-r)*/2-n/2  *  £l(*n).  □ 

The  AT  2  =  fl(7-  2log2(  1  +  n  /r  )  )  obtained  by  bipartition  techniques  is  weaker  than  the  I/O  bound 
AT  —  (likn  )  for  a  wide  range  of  values  of  r  and  T.  However,  we  can  greatly  improve  the  AT2  lower 
bound  by  the  square  tessellation  method. 

Theorem  43.  Any  VLSI  (n,£)-sorter,  with  k  <  logn,  satisfies  the  bound 

AT 2  =  Sl(nr  ),  4.28 

where  r  =  2A  . 

Proof.  We  plan  to  show  that  an  I/O  assignment  in  which  P  \  reads  exactly  r/2  bits  of  the  least 
significant  input  position  requires  an  information  exchange  (lir )  .  Equation  4.28  will  then  follow 
from  Theorem  3.6  with  %(  =  {Xf:  i  =  0, . . .  —  I } ,  M  -  IT/  i-  n,  m  -  r/2,  and  I  =  fi(r ) . 


We  begin  by  showing  that,  chosen  an  arbitrary  set  of  input  bits 


4.29 


and  an  arbitrary  set  of  output  bits 


U. 


=  IV'IT 


2-1 


4.30 


with  t.  <  r. ,  the  remaining  input  bits  can  be  set  to  constant  values  to  enforce  the  condition 


Y,° 


i  =  0,1, .  ..s  /2  —  1. 


More  scecificallv  we  set 


4.31 
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XK  =  2i  +X,f.  i  =  0,l,...,r/2-l, 

and  we  divide  the  remaining  n-r  input  keys  arbitrarily  into  r/2  +  1  sets  such  that,  for  i  -  0.7, . . .  r/2-1, 

the  i-th  set  contains  (r,  — t,  _j  —  1)  keys  whose  value  is  set  to  2i  (r  j-  0)  ,  and  the  ( r/2kth  set  contains  %.• 

(n  —  1  —  tr/2_ i)  keys  whose  value  is  set  to  r-1.  The  output  sequence  corresponding  to  this  input  is 

shown  in  Figure  4.3,  and  it  satisfies  Equation  431. 

Now,  let  us  consider  a  protocol  that  assigns  exactly  r/2  variables  of  to  P  x  and  n  -  r/2  Or/ 2) 
to  Ft  .  Let  P,  be  the  processor  that  outputs  more  entries  of  Y°  (break  a  tie  arbitrarily).  We  can  "" 

always  find  two  sets  H m  and  Uou  38  in  439  and  4.30  such  that  “Ud,  is  input  by  P 3_,  and  If*  is 
output  by  P,  .  Equation  431  implies  that  r/2  bits  input  by  P^  are  output  by  P,  ,  for  a  suitable 
value  of  input  variables  not  in  V.„ .  Hence,  /  ^  r  / 2  »  (Hr ) ,  as  desired.  □ 

We  shall  now  prove  an  AT  lower  bound  on  the  performance  of  an  (n,£)-sorter  for  k  <  logn.  The 
proof  is  based  on  information  exchange  under  bounded  storage  (saturation).  However,  the  technique  of 
Section  13  will  be  applied  not  to  the  entire  computation  interval  [0J\  but  Just  to  suitably  defined 
subintervals. 

Theorem  4.10.  Any  VLSI  (njfc)-sorter,  with  k  ^Logn  ,  satisfies  the  bound 

AT  =  fl(n  \/T )  4.32 

where  r  =  24  . 

Proof.  For  some  real  <r€[0, 1/2] ,  in  any  tessellation  of  the  layout  with  square  cells  of  area  err  ,  ~ 

there  is  at  least  one  cell  C  that  outputs  m  ^  n  err  /A  bits  belonging  to  F° ,  the  least  significant  output 
position.  Based  on  the  output  schedule  of  cell  C,  we  partition  the  interval  [0J"  ]  into  consecutive  inter- 

MB 

vals  [f *  +  1/,  +1]  (where  i  -  0,  1, ... ,  L- 1,  with  :0-  —  1  ,  and  tL-  T  ) ,  in  such  a  way  that,  in  each 
interval.  C  outputs  between  r/2  and  r  (1/2  +  <r)  bits  of  Y°  .  We  can  always  find  such  a  partition,  since 
the  cell  can  output  at  most  err  bits  at  any  given  time.  Furthermore,  since  cell  C  outputs 
m  ^  nor  /A  bits  of  Y  0 ,  the  number  of  intervals  is  at  least  L  ^  n  <7/((l/2  +  <r)A ) . 


(t0-1)  keys 


itj-tj^-1)  keys 
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Figure  4.3.  Sending  r'2  arbitrary  1a  input  bits  to  r‘2  arbitrary  1a  output  bits. 


We  now  establish  a  lower  bound  on  the  duration  —  t,  of  the  :  -ih  interval.  As  we  have  seer, 
in  Theorem  4.9,  given  an  arbitrary  sequence  of  r'2  components  of  }"”  ,  and  an  arbitrary  sequence  of  r  2 
components  of  X'J ,  it  is  possible  to  select  the  remaining  inputs  of  the  sorter  in  order  to  realize  the  iden¬ 
tity  function  between  the  two  sequences.  Let  us  choose  the  r/2  components  of  }"'*  among  those  output 
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by  cell  C  in  [r,  +l,r,  +1] ,  and  the  r/2  components  of  X°  in  any  arbitrary  way.  Then,  since  X°  must  be 
completely  input  before  any  bit  of  T°  can  be  output,  during  the  interval  [r,  +U,  +1]  cell  C  outputs  r/2 
bits  that  are  already  in  the  system  at  the  beginning  of  the  interval  Since  C  could  store  at  most  err  of 
them,  the  remaining  (1/2  —  <r)r  must  flow  across  the  boundary  of  the  cell,  whose  length  is  4>/orr  ,  dur¬ 
ing  the  interval  Hence, 

ti+x-ti  >  (1/2  —  o r>  14-Jar  =  Jr  (1/2  -  <r)A4  Ja\  4.33 

and 

T  =  +  1  ^  4.34 

I  skO 


Recalling  the  bound  on  L  we  obtain 


4.35 


which  completes  our  proof.  (Inequality  4.35  yields  the  best  bound  for  cr  =  ( V57  —  7)/4  —  0.138).  C 


From  Theorem  4.10  and  Theorem  4.9  we  know  that  there  exist  constants  di  and  d?  such  that  the 
performance  of  any  (n,fc)-sorter,  with  k  4  logn  ,  satisfies  the  bounds  AT  ^  di«  n/t"  ,  and 
AT2  ^  d2nr  •  These  bounds  coincide  at  time  T 0-  (dj/d^VT  .  The  AT  bound  is  stronger  for  T  > To. 
and  the  AT  2  bound  is  stronger  for  T  <T 0 . 


The  next  two  theorems  provide  us  with  some  more  information  on  the  feasibility  region  of  the 
sorting  problem,  for  short  keys. 


The  first  theorem  gives  a  lower  bound  on  the  area,  regardless  of  the  computation  time.  The  same 
result  has  been  independently  obtained  in  [Sg84a]  with  a  different  proof. 


The  second  theorem  gives  a  lower  bound  on  computation  time,  regardless  of  the  area. 


Theorem  4.11.  The  area  of  any  VLSI  (nJc)-sorter,  with  the  k  < Logn  ,  satisfies  the  bound 

A  =  Q(r  log(l  +*/r)).  4.3£ 

Proof.  As  we  have  already  seen,  due  to  the  functional  dependence  of  the  variables  in  i’°  upon  the 
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variables  in  X  0  ,  and  to  the  time-determinate  property  of  the  I/O  protocol,  that  there  is  a  time  t 0  such 
that  all  the  components  of  X°  are  input  not  later  than  r°  ,  and  all  the  components  of  are  output 
after  t°  .  We  have  also  seen  that  if  we  set  the  first  r  input  keys  to  the  constant  value  X,  =i,i  -  0, . . 
.  r-1,  the  remaining  part  of  the  input  multiset  can  be  uniquely  reconstructed  from  l”0. 

Thus,  a  represen tation  of  \Xr,...J(„  _,|  is  essentially  stored  in  the  system  at  time  t°  ,  and  we  know 
(see  Eq.  2.13)  that  Q(r  log  (l  +  n  /r ) )  bits  are  necessary  to  encode  this  multiset.  □ 

Theorem  4.12.  The  computation  time  of  any  VLSI  (njfe)-9orter,  with  k  <  log n,  satisfies  the  bound 

T  =  H(logn).  4.37 

Proof.  Equation  4.37  follows  from  the  assumption  of  bounded  fan-in  when  considering  that  the  com¬ 
ponents  of  Y  0  depend  on  all  the  nk  input  variables.  □ 

4.2.2  Medium-Length  Keys 

In  this  section  we  turn  out  attention  to  the  Inlogn  -  hl-sorting  problem,  and  we  derive  bounds  for 
0  <  h  <  logn. 

A  simple  observation,  which  is  useful  for  lower  bound  arguments,  is  that  by  setting  the  logn 
leading  bits  of  the  input  keys  to  an  appropriate  value,  we  can  force  the  output  sequence  to  be  an  arbi¬ 
trary  permutation  of  the  input  sequence.  In  particular  the  h  least  significant  bits  can  be  chosen  arbi¬ 
trarily  to  creat  information  flow. 

This  observation  was  originally  exploited  by  Thompson  [T80]  to  show  that,  for  word-local  proto¬ 
cols.  and  k  -  logn  +  Qilogn  )  ,  AT2  =  H(n2log:n  )  .  A  straightforward  generalization  of  Thompson's 
argument  allows  to  prove  the  following  theorem. 

Theorem  4.13.  Any  \  LSI  in,  logn  -  h)-soner.  with  h  >  0.  and  with  word-local  protocol,  satisfies  the 
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Proof.  We  will  prove  the  theorem  by  showing  that  the  class  of  I/O  assignments 
H  —  {“n :  P  i  inputs  exactly  n  /2  keys  }  has  information  exchange  7  =  fl(n  h ) . 

Without  loss  of  generality  we  can  assume  that  P  j  inputs  keys  X  & . . .  Jin  n-i  and  that  P  2  outputs 

at  least  n/2  keys;  say  Ya&Xa . 2_,  .  Let  Yany...Xa„.l  be  the  remaining  keys.  By  setting  the 

logn  leading  bits  of  X,  to  the  binary  representation  of  integer  a,  we  ensure  that  Ya.  =  X,  i  -  0 . n- 

7.  Thus,  the  h  least  significant  bits  of  each  key  input  by  7*  ]  are  output  by  P2  ,  and 
7  ^  nh  fl  —  Q  (nh  ) ,  as  claimed.  □ 

No  better  bounds  could  be  obtained  on  7  under  the  word-local  protocol,  since  dink )  bits  are 
sufficient  to  encode  the  entire  input.  The  important  question  instead  is  the  removal  of  the  "word-local" 
restriction. 

Some  preliminary  considerations  and  an  example  will  help  us  put  in  the  proper  perspective  the 
nature  of  the  problems  arising  when  dealing  with  arbitrary  protocols 

The  output  of  the  sorter  is  a  permutation  of  the  input,  so  that 

Y.-Xnn  i  =  0,1,. ..*-1  4.39 

where  iK0),iKl), . . .  ,iKn  —1)  is  a  permutation  of  0,  1, _ _  n-1.  Focussing  on  the  bit  position  of  index  j 

of  the  date  we  have 

17-X4)  i  —  0,1,  ...ji  — 1.  4.40 

Thus,  there  is  an  information  flow  from  the  input  to  the  output  ports  of  the  same  position,  which  we 
call  primary  flow.  The  primary  flow  of  each  position  is,  in  a  way,  self-contained,  because  each  bit 
involved  enters  the  system  and  leaves  the  system  maintaining  its  identity.  However,  the  exact  destina¬ 
tion  of  each  bit  within  its  own  position  depends  on  tr  ,  which,  for  position  j,  is  determined  by  the  value 

of  the  data  in  positions  j ,  j  +1 . k  -1.  Thus,  there  is  another  kind  of  informatioiuflowing  from  most 

significant  to  least  significant  positions,  which  we  call  secondary  flow. 

As  we  can  see  from  the  proof  of  Theorem  4.13,  the  complexity  of  word- local  sorting  is  based 
exclusively  on  primary  flow.  Let  us  now  consider  an  example  of  protocol  which  requires  exclusively 
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Example.  We  want  to  estimate  the  information  exchange  of  the  protocol  that  assigns  the  leading  posi¬ 
tions  X  1  ,  Y 1  ,  j  -  k/2 . k-1  to  P  j,  and  the  least  significant  positions  XJ  ,  Y  ■'  ,  }  -  0 . k/  2-1  to 

P  2  .  Since  each  bit  position  is  completely  input  and  output  by  the  same  processor,  there  is  no  primary 
flow.  However,  P  2  needs  information  on  the  relative  order  of  the  most  significant  part  of  the  keys  in 
order  to  know  which  permutation  to  apply  to  the  least  significant  ones.  Thus,  we  are  in  the  presence  of 
secondary  flow  alone. 

It  is  clear  that,  no  matter  how  large  k  is,  I  ^  nlogn  .  In  fact  the  leading  bits  of  P  i  can  be  sorted 
ignoring  the  bits  of  P2,  so  that  no  information  transfer  is  needed  for  P  2  to  P  i .  On  the  other  hand,  all 
that  P  i  needs  to  know  about  the  portion  of  keys  dealt  with  by  P ,  is  the  relative  order,  which  can  be 
encoded  in  no  more  than  nlogn  bits. 

We  can  also  show  that,  for  k  -  2  logn,  /  ^  log  (n !)  =  ( nlogn  -  lower  order  terms).  In  fact,  let  us 

consider  the  class  of  instances  of  the  problem  such  that  X,  «  2*/2iKt)  +  t  (t  -  0 . n-/),' where 

tt(0),  . . .  ,irin  — 1 )  is  a  permutation  of  0 . n-1.  The  corresponding  output  is  =2 L 12  i  +  ir-1  (i )  (i  * 

0 . n-1).  Thus,  at  least  log  nl  bits  (to  describe  ir  )  are  sent  from  P  x  to  P y  In  conclusion,  for  k  -  2 

logn,  I  -  nlogn-(  lower  order  terms). 

A  more  detailed  analysis  would  show  that,  for  1  $  k  ^  2  logn  ,  /  ^nk  /2  ,  and  that  for  £  >  2 
logn  I  nlogn  ,  regardless  of  k.  The  fact  that  secondary  flow  never  exceeds  nlogn  has  important  conse¬ 
quences,  as  we  shall  soon  see.  C 

When  analyzing  arbitrary  protocols,  primary  flow  and  secondary  flow  must  be  considered  simul¬ 
taneously.  In  fact,  in  particular  situations  one  of  the  two  may  be  negligible  but,  as  we  shall  see.  they 
cannot  be  simultaneously  small. 

Leighton  [LS4j  has  shown  how  to  combine  primary  and  secondary  flow  bounds  with  the  help  of 
cyclic  shift  arguments.  His  result  was  stated  in  the  form  AT 2  =  fl(n3log:n  )  for  k  ^  7  logn  . 


Exploiting  similar  ideas,  although  with  a  rather  more  elaborate  construction,  we  will  show  that 
AT2  —  fl(n2/i2)  for  0  <  h  ■  k  -  logn  <  logn.  An  obvious  consequence  is  that  for  k  =(l  +  aUogn , 
(a>0),  AT2  =  fl(n  ^ogvi ) .  However,  some  discussion  is  in  order,  to  clarify  a  subtle  point.  The  H  nota¬ 
tion  is  misleading  here  (at  least  it  has  mislead  us  for  some  time),  and  it  is  better  to  rewrite  the  bounds 
in  the  following  form  :  For  k  -  (7+  a  )logn,  (a  >  0)  ^AT  2  ^  c  (a)n  ^og^ ,  where  c  (a)  depends  on  a. 
but  is  independent  of  n.  The  crucial  point  we  want  to  address  is  that,  for  reasons  that  the  next 
theorems  on  lower  bounds  will  clarify  and  that  are  essentially  related  to  the  saturation  behavior  of 
secondary  flow,  the  dependence  of  c  (a)  on  a  is  quadratic  for  a  «  1  ,  but  is  linear  for  or  »  1  . 
This  fact,  together  with  Leighton’s  observation  that,  when  k  >  >  logn,  one  can  construct  VLSI  sorters 
whose  complexity  is  subquadratic  in  k  [LS4l  shows  that  it  would  not  be  appropriate  to  consider  a  prob¬ 
lem  with  k  -  1.1  logn  and  a  problem  with  k  —  100  logn  in  the  same  class,  although,  superficially,  we 
can  say  that  AT2  =  fl(n ^og2* )  in  both  cases. 

We  are  then  motivated  to  distinguish  between  medium-length  and  long  keys.  Obviously  the 
choice  of  k  -  2  logn  as  separation  of  the  two  classes  is  rather  conventional,. but  it  will  serve  our  purpose. 
With  this  premise,  we  shall  now  prove  the  AT2  bound  for  medium-length  keys. 

Theorem  4.14.  Any  VLSI  (n,  logn  +  h)-soner,  with  0  <  h  <  logn,  satisfies  the  bound 

AT2  -  (l(n2h 2).  4.41 

Proof.  We  begin  by  partitioning  the  input  array  as  X  -  [D,  E.  F]  where  D.E.F  are  blocks  of  d,  logn  - 
d,  and  h  consecutive  columns  respectively.  The  partition  of  the  generic  key  X,  is  shown  in  Figure  4.4. 

We  shall  prove  Eq.  4.41  by  showing  that  I  -  (1  (n  h  )  for  the  class  of  input  assignments  such 
that  P  y  and  P  2  input  each  exactly  nd/2  of  the  entries  of  D.  Below  we  shall  derive  two  lower  bounds 
on  I,  and  we  will  see  that  at  least  one  of  these  bounds  is  not  smaller  than  n.  11 

Adopting  the  same  notation  as  in  Section  4.1  we  let  c;  be  the  number  of  components  of  Y  J  out- 


Figure  4.4.  Partition  of  key  X,  ,  for  the  proof  of  Theorem  4.14. 

put  by  Pi  .  We  also  let,  for  y€(  0,1/2]  ,  and  Qn±\j:j  <h,yn  $  c  <  (l-y)n  )  , 

Qi  —{ j  '-hxj  >  ( 1  —  y)  n  ),  and  G  2  -I :  ^ » c,  <  y  n  }.  Finally  we  denote  by  qt  the  cardinality 

of  Q,  ,  i  -  0,  1,  2. 

Primary  Flow  Bound.  By  suitably  choosing  the  entries  of  D  and  E,  we  can  produce  any  of  the  n  cyclic 
shifts  of  array  F.  Then  we  can  use  the  notation  of  Section  4.1  and  Theorem  4.1.  with  Z  -  F,  and  q  ■  fu 
to  obtain  the  bound 

1  >  V?o*  4.42 

which  is  valid  for  any  y  €  [  0, 1/2  ] ,  and  where  a.-,  is  a  function  of  y. 

Secondary  Flow  Bound.  Without  loss  of  generality  let  qt  ^  q2  ,  so  that  >  (h  -qj/ 2  .  Let  also  Q 
be  a  subset  of  Q  t  consisting  of  d  ^  q  *  bit  positions  (the  positions  of  Q  are  not  necessarily  consecutive. 
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but  they  will  be  thought  of  as  ordered  from  most  to  least  significant  as  they  are  in  X ).  We  then  con¬ 
sider  the  following  class  of  input  instances,  where  l  -2*  and  a  -  n/L  Let  input  array  X  be  partitioned 

into  a  blocks  each  of  l  consecutive  rows,  and  let  the  entries  of  the  (i+l)-st  block  (i  -  0 . a-1)  be  set  so 

that:  (see  also  Figure  4J) 
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Figure  4.5.  Configuration  of  inputs  and  outputs  in  the  proof  of  Theorem  4.14.  In  the  actual  arrays 
X  and  Y,  the  columns  of  blocks  Q  and  F-Q  are  mixed,  but  the  left-to-right  order  of  the 
columns  in  each  block  is  maintained.  » < 


_ ImI 


( 1 )  The  rows  of  D  have  values  it, (0).  where  ir,  is  a  permutation  of  0, ... , 


1-1. 


r 

l 
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(2)  The  rows  of  E  are  all  identical  and  equal  to  t. 

(3)  The  rows  of  Q  have  values  0, . . . .  M. 

(4)  The  rows  of  F  -  Q  are  set  to  zero. 

It  can  be  easily  shown  that  if  we  partition  the  output  array  /'  into  t  blocks  each  of  a  consecutive  rows, 
the  (s  +  7)-st  block  from  the  top  (s  »  0. . . .  l-l)  has  the  following  structure  (see  also  Figure  4.5b): 

(1)  The  rows  of  D  have  value  s. 

(2)  The  rows  of  E  have  value  0 . a-1. 

( 3)  The  rows  of  Q  have  values  fl\f  Ks  \ir~Ks  ), . . .  .w/ijCs  ) . 

(4)  The  rows  of  F  -  Q  have  value  zero 

Thus,  permutations  ir0mn  L . . .  ,irq  _j  can  be  uniquely  reconstructed  from  outputs  in  Q,  so  that  Q  carries 
ail  logl  -  lower  order  terms)  and  bits  of  information  relative  to  section  D  of  the  input.  Since  P  |  inputs 
only  nd/2  of  the  bits  describing  it,*  . . .  ,irq  M  ,  and  outputs  at  least  (1  -y)nd  of  the  bits  of  Q  from 
which  ir,H . . .  ,irq  can  be  recovered,  we  conclude  that  at  least  (1  —  y)nd  —nd  /2  bits  are  transferred 
from  P  2  to  P ! .  If  we  choose  d  =  q , ,  we  obtain 

I  >(l/2-y )qvn  ^  (1/2  —  yX(/i  -?0)/2]n,  4.43 

where  again  y€  [0,1/2]  and  q0isz  function  of  y  . 

Combining  the  Bounds.  If  we  select  y  =  1/6  .  bounds  4.42  and  4.43  become 
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I  >  maxCn  qo/6,n(.h  —q  o/6))  ^  nh  / 12. 


(A  slightly  larger  constant  than  1/12  is  obtained  if  we  optimize  the  choice  of  y  ,  which  yields 
y  =  ( >/2— 1)/2  and  /  =  >nh  /(I  —  n/2/2)2 .  □ 

If  we  combine  Inequality  4.46  with  the  bound  in  Equation  127  on  the  performance  of  boundary 
chips  (in  our  case  M  -OCnlogn  ))  we  obtain  the  following  result. 

Theorem  4.15.  Any  VLSI  (nJogn+h)-«>rter,  with  0<A<logn,  and  with  all  its  I/O  ports  on  the  boun¬ 
dary  satisfies  the  bound 

AT 2  =  Cl(n2hlogn).  4.47 

We  end  this  section  on  medium-length  keys  with  some  results  on  minimum  area  anH  on 
minimum  computation  time.  The  result  on  the  area  (actually  generalized  to  multilective  I/O  protocols), 
as  well  as  the  one  on  the  AT  2-measure  of  Theorem  4.14  have  been  independently  derived  by  [Sg&4bL 
with  a  different  approach. 

Theorem  4.16.  The  area  of  any  VLSI  ( njogn  +  /i}-sorter,  with  0  <  h  <  logn,  satisfies  the  bound 

A  s  Q(nA),  4.48 

Proof.  Due  to  the  functional  dependence  of  the  variables  in  Y 1  on  the  variables  in  X  1  ,  with 
}'  ^  j  ,  and  to  the  time-  determinate  property  of  the  I/O  protocol,  there  is  a  time  t*  such  that  all  the 
components  of  X 1  *Xh  are  input  not  later  than  r\  and  all  the  components  of 

Y  *  ~lT  *  . . .  Jr0  are  output  after  r*. 

Now,  let  us  consider  the  same  class  of  input  instances  as  in  Theorem  4.14,  which  is  also  illustrated 
in  Figure  4*5a.  At  time  r*  all  entries  of  array  D  have  been  already  input,  and  no  entry  of  Q  has  been 
output  yet.  However,  Q  is  an  equivalent  encoding  of  D,  and  hence  (l(a  l  log  l)-  Clin  l )  bits  that 
represent  D  must  be  stored  by  the  system  at  time  t*.  □ 

Simple  fan-in  arguments  allow  us  to  prove  the  following  result. 

Theorem  4.1 7.  The  computation  time  of  any  VLSI  ( nlogn  +  ^/-sorter  with  0  <  h  <  logn,  satisfies  the 
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4.2.3  Long  Keys 

As  we  have  anticipated  in  the  preceding  section,  for  k  >  2  logn,  bipartition  techniques  are  not 
very  useful,  because  there  are  input-balanced  protocols  for  which  the  information  exchange  does  not 
exceed  nlogn,  regardless  of  k. 

However,  we  intuitively  expect  the  area-time  complexity  of  the  (njfc)-sorting  problem  to  be 
increasing  with  k,  for  fixed  n.  For  example,  the  trivial  I/O  bound  tells  us  that  AT  =  QC k  n  ) .  But  we 
know  more;  in  fact,  a  sorter  of  n  keys  of  length  k  is  trivially  a  cyclic  shifter  of  n  words  of  length  k- 
logn  (the  least  significant  part  of  the  kevsX  and  hence  it  satisfies  the  bound  AT 3  =  fl(  (k  —  logn  )  n  2 ), 
according  to  Theorem  4.4. 

This  bound  can  be  further  improved  by  taking  into  account  the  fact  that  a  suitable  choice  of  the 
Logn  leading  bit  positions  of  the  input  keys  of  a  sorter,  can  force  at  the  output  an  arbitrary  permutation 
of  the  keys,  (not  only  the  cyclic  shifts). 


Theorem  4.18.  Any  VLSI  (/ijfc)-sorter,  with  k  ^  2  logn  ,  satisfies  the  bound 


AT2  —  fl  - - (nlogn  y2  .  4.50 

logn 

proof.  To  simplify  some  details  of  the  proof  we  assume  k  ^  3  logn  .  (For  2  logn  ^  k  <  3  logn 
the  result  is  a  simple  consequence  of  Theorem  4.14,  in  any  case.)  Since  an  appropriate  selection  of  the 
logn  leading  bit  positions  of  the  input  produces  an  arbitrary  cyclic  shift  of  the  remaining  positions,  we 
can  use  some  of  the  results  derived  in  Section  4.1.  Let  us  first  recall  some  notations.  We  denote  by  b . 
[respectively  c,  ]  the  number  of  input  [respectively  output]  keys  whose  j  -th  bit  is  input  [output]  by 
P  ;•  Then  for  given  y€[0, 1/2] .  Q i~\  j : j  <  k  —logn.c,  <  yn  ]  and  q,  -  IQ,  !  .  i  ■  1.2.  Moreover 
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l  -iom  —I 

B  =  £  bj,  and  B,  =  £  bj.  *  =1.2. 

y  *o  / «  C, 

The  plan  of  the  proof  is  to  show  that  any  I/O  assignment  in  which  P  j  reads  B  -  nlogn  hits  that 
belong  to  positions  Xk~tetn~l ....  ,X°  ,  requires  an  information  exchange  of  I  =  (X nlogn )  bits.  Equa¬ 
tion  4-50  will  then  follow  from  Theorem  3jS  with  V  -(Xj-'tO  ^  i  ^  n—  1,  0  <  j  ^  k  —  logn  —  1}, 
M  =  I  1/  I  »  (*  —  logn  )n  ,m-B~  nlogn,  and  I  =  Cl  ( nlogn  ) . 


Primary  Flow.  By  applying  Inequality  4.9,  and  considering  that,  Bi  ^  q  .  and  that  in  our  case 
5  -  nlogn ,  we  obtain 

/  ^  y  {nlogn  —  n  jj).  4.51 

If  we  reverse  the  role  of  Pi  and  P2  Theorem  4-52,  and  we  consider  that  P2  reads 
(k  —  2  Zogn )  n  ^  nlogn  bits  of  1/ ,  then  Inequality  4.9  yields 

I  ^  y(nlogn  —  n  ?2).  4-52 

Secondary  Flow.  Let  P,  (r  -  1  J)  be  the  processor  that  reads  the  majority  of  the  bits  that  belong  to  the 
logn  leading  bit  positions  (break  a  tie  arbitrarily).  We  will  then  show  that  the  secondary  flow 
increases  with  q,  .  To  be  specific  we  will  assume  that  s  -  2,  and  we  will  bound  the  secondary  flow 
from  P2  to  P  j ,  which  we  finally  combine  with  Eq.  4-51.  (If  s  -  1,  we  can  argue  in  a  similar  fashion 
resorting  on  Eq.  4J52.)  After  selecting  arbitrarily  a  set  Q*  of  (logn  —  q ,)  bit  positions  of  significance  less 
than  (k-logn)  and  not  in  Q  ( ,  we  define  the  set 

Q  =  Q 

We  then  consider  the  following  class  of  input  instances.  We  sec 

(1)  The  leading  logn  bits  of  X,  to  the  value  irii )  where  irii  )  is  a  permutation  of  0, ... ,  n-1. 

(2)  The  logn  bits  of  X,  which  belong  to  positions  in  Q  to  the  value  L 

(3)  All  remaining  bits  to  the  value  zero. 


‘r*0 


Then,  the  input  array  Y  has  the  following  structure. 

( 1 )  The  leading  logn  bits  of  Y,  represent  integer  t. 

(2)  The  logn  bits  of  Y,  which  belong  to  positions  in  Q  represent  integer  ir“K i ) . 

(3)  All  remaining  bits  are  zero. 

Thus,  it  can  be  recovered  from  the  output  positions  \Y  1  :  j  6  Q }  .  Since  P  j  outputs  at  least 
q  i  (l  —y )n  bits  of  these  positions,  and  it  reads  at  most  1/2  nlogn  bits  among  those  that  specify  ir  , 


1  ^  (l  —  y)q^n  —  i/2  nlogn. 


bits  of  information  on  ir  have  to  be  communicated  by  P  2  to  P  t 


Combining  the  Bounds.  If  we  multiply  both  sides  of  bound  4.51  by  (1  —  y) ,  and  both  sides  of  bound 
4_53  by  y  ,  and  we  sum  the  sides  of  the  resulting  bounds,  we  obtain 

I  ^  y  (1/2  —  y)  nlogn.  4.53 

For  7  =  1/4,/  ^  nlogn  /16  .  Then,  by  Theorem  3.6  we  have  completed  the  proof.  □ 

We  derive  now  an  AT  bound  for  sorting  of  long  keys,  using  saturation  techniques. 

Theorem  4.19.  Any  VLSI  (nj!r)-sorter,  with  k  ^  2  logn  satisfies  the  bound 

AT  —  Q  ( kn  'J nlogn  ).  4.55 

Proof.  In  this  proof  we  introduce  several  parameters  whose  value  will  be  later  specified  to  optimize 
the  lower  bound.  The  reader  could  find  it  useful  to  assume  -  in  following  the  argument  -  that 
y  —  1/12.  <7  =  5/24,6  =  1/4,  |  =  lnnd  0  -  3/8  .  Although  suboptimal,  this  choice  of  the  parame¬ 
ters  will  give  the  right  feeling  for  their  range,  and  will  also  simplify  the  arithmetic. 

W’e  plan  to  lower  bound  the  information  exchange  with  bounded  storage  /  (m  i  s ,  oo  ) ,  and  then 
apply  Theorem  3.7.  W'e  consider  the  class  of  I/O  assignments  such  that  exactly  m  of  the  variables  in 
0  =  1 X  ■ :  0  ^  ^  n  —  1,  0  ^  j  ^  k  —  logn  —1}  are  input  by  P  Y  .  As  usual,  we  denote  by  c  the 


•  »  ’  +  ’n't 

'  •  #  O  mfm  "  -  • 
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number  of  bits  of  position  Y 1  which  are  output  by  ?!  ,  and  -  for  0  <  y  ^  1/2  -  we  let 
Qi  ~  1/  >  (l  —  y)n,0  <  j  ^  k  — logn  — 15,  and  qx  ~  i  Q,  I.  As  we  have  repeatedly  seen,  a 

trivial  transformation  of  cyclic  shift  yields  the  following  bound  to  the  information  exchange  (under 
unbounded  storage): 

I  >  y(m  -n  4-56 

Obviously,  this  bound  holds  a  fortiori  when  the  storage  of  P  j  is  bounded,  but  we  need  to  combine  it 
with  other  bounds  in  order  to  obtain  the  desired  result.  The  following  observations  provide  some 
insight  on  how  a  bound  on  the  storage  may  affect  the  information  exchange. 

(a)  At  the  time  when  the  last  bit  of  a  given  position  is  input,  no  bits  of  that  position  have  been  out¬ 
put.  Therefore  n  bits  are  stored  in  the  system  (P ,  and  P  2)  at  that  time. 

(b)  If  during  the  time  interval  [r  ,/  J  the  system  outputs  p  bits  belonging  to  X  positions  in  set  Q ,  , 
then  at  least  p  —X  y  n  of  those  bits  are  output  by  Px  .  (In  fact,  from  the  definition  of  Q  j  ,  at 
most  y  n  bits  per  position  are  output  by  P2 .) 

(c)  If,  at  a  given  time  r,  p  bits  that  have  to  be  output  by  P  j  are  stored  in  the  system,  and  P I  has  a 
storage  bound  of  s  bits,  then  at  least  p  —  s  bits  are  stored  in  P  2  at  time  t,  and  they  will  eventu¬ 
ally  be  sent  to  P  v,  thus  contributing  an  amount  p-s  to  the  information  exchange. 

(d)  If  during  the  time  interval  [f  „f  J  P  i  outputs  q  bits  which  belong  to  a  set  Q  \QQ  j  of  X  <  logn 
positions,  then  at  least  iq-s)  bits  are  transferred  from  P2  to  Pt  during  the  same  interval,  for  an 
appropriate  class  of  problem  instances.  The  idea  is  that  the  outputs  of  P ,  carry  q  bits  of  informa¬ 
tion  on  the  sorting  permutation,  and  at  most  s  of  them  could  have  been  in  P  t  at  time  t  j  .  The 
details  of  the  argument  are  similar  to  those  of  the  proof  of  Theorem  4.17.  We  need  to  set  the  logn 
leading  bits  of  X,  to  represent  iKi )  (where  tKO), ...,ir(n  —  1)  is  a  permutation  of  (0,  ... ,  n-l). 
We  also  augment  Q  t  to  Q*  by  adding  (logn  -  X  )  arbitrary  positions,  and  we  set  the  logn  bits  of  X, 
that  belong  to  (2*  to  the  value  t.  Then  the  output  position  of  Q*  will  be  7r“n0), . . .  ,n~l(n  — 1) 
where  <r  is  the  inverse  of  7r  and  q  bits  of  rr  are  output  by  P  t . 


In  order  to  exploit  the  preceding  observation  systematically  in  the  analysis  of  information 
exchange,  we  need  to  define  several  quantities.  We  begin  by  decomposing  the  interval  [0.T]  during 
which  the  computation  takes  place,  according  to  the  I/O  prorocol,  which  is  assumed  to  be  place- 


determinate  and  time-determinate. 

If  we  focus  on  a  given  position  j ,  we  see  that  the  variables  of  X  ;  are  generally  input  at  different 
times.  We  are  particularly  interested  in  the  time  when  the  Ls.  bit(s)  of  a  given  position  are  input.  For 
our  proof,  we  need  to  consider,  for  each  time  r,  the  number  A(r )  of  positions  in  Q ,  whose  last  bit(s)  are 
input  exactly  at  time  t.  (For  example,  \(t )  is  zero  when  no  bit  is  input  at  time  t,  but  also  when  all  the 
bits  input  at  time  t  belong  to  positions  for  which  some  bits  remain  to  be  input.)  We  will  treat 
separately  the  instants  when  A(r )  is  large  from  the  instants  when  A(r )  is  small.  In  fact,  in  the  first  case 
we  can  immediately  see  that  there  must  be  a  large  saturation  and  secondary  flow. 

Formally,  for  a  given  €  (0  <  e  <  1 ) ,  we  distinguish  the  times  t <  t  \  <  •  *  *  <  r  when 
A(t )  ^  €  logn  ,  from  the  times  r  *!  <  t‘2  <  •••  <  t\  when  A(r )  <  e  logn  .  Since  for  all  the  q  , 
positions  of  Q !  the  input  is  completed  at  some  time,  we  have 

L  A(r  \  )  +  L  %)  =  ?i-  4-57 

*1=1  n  =1 

We  now  consider  separately  the  contribution  to  the  information  exchange  due  to  positions  whose  input 
is  completed  in  each  of  the  two  sequences.  We  assume  that  P  l  can  store  s  =  <r  niogn  bits  (0<tr  <  1). 

Sequence  t '  .  If  we  apply  observation  (a)  to  each  of  the  kit  \  )  positions  whose  input  is  completed 
exactly  at  time  :  \  .  we  see  that  at  least  kit  )n  bits  are  in  the  system  at  this  time.  From  observation 
(b)  at  least  (1  —  y)AU  /,  )n  of  these  bus  have  to  be  output  by  Pt  .  Finally,  observation  (cj.  with 
t  =  t\  ,  p  =(1  —  y)kU  \  )n  and  s  -  <r  niogn  ,  allows  the  conclusion  that  the  bit  positions  we  are 
considering  contribute  [(1  —  */)A(r  \  )  —  <Tlogn  ]n  bits  to  the  information  exchange.  If  we  sum  over  ail 
: we  obtain  a  global  contribution 

u 

I '  -  (1  —  y)  £  kit  \  )n  —  u  ardogn.  4.5  S 

H  =1 
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Since  k(t  \  )  ^  elogn  ,  we  obtain 

U 

X  X(*  *  )  ^  u  e  logn , 

/I 

and  substituting  for  u  in  Eq.  4.58  we  finally  can  write 

/’  >  (l-y-or/€)£x(r‘*)n.  439 

A  >1 

Sequence  t  *.  We  decompose  the  interval  [OJ*]  into  consecutive  intervals  [r  "hj  +  1,  t  "^  ]  for  i  •  0, 1, . . 
. ,  L  .  (Here  indices  €{0,1,..., v  }  and  we  have  added  to  the  sequence  t m  two  points ,  t  %  —  1 

and  t  V  T .)  The  decomposition  is  chosen  in  such  a  way  that,  for  a  given  £  (e  <  £  ^  l)  and  for  i  -  0, 
1 . L- 1; 

*<-i 

(£  —  €)logn  <  £  X(r  %  )  <  £  logn.  4.60 

A*A,  +1 

Such  a  decomposition  always  exists,  since  X(t  %  )  <  «  logn  .  Moreover, 

L  ^  J\(t  \  )A£  logn  ),  4.61 

A  =1 

since  at  most  f  logn  positions  complete  their  input  in  any  given  interval.  We  will  evaluate  the  contri¬ 
bution  to  the  information  exchange  given  by  each  of  the  intervals  of  the  decomposition.  Let  us  focus 
on  one  specific  time  interval,  say  [tltJ  .  For  a  given  0  such  that  0  <  0  <  |  —  6  ,  we  distinguish  two 
cases. 

(i)  p  <  0  nlogn  .If  p  <  0  nlogn  ,  at  time  t2  at  least  (£  —  €  —  &)nlcgn  bits  are  in  the  system, 

and  at  most  y  £  nlogn  of  them  have  to  be  output  by  P  2  .  Thus,  at  least  ((l  —  y)£  —  €  —  Q)nlogn 

bits  have  to  be  output  by  P  t ,  and,  due  to  storage  limitations,  at  least 

/  *o  ^  (£(  1  —  y)  —  €  —  0  —  <r\dogn  4.62 

of  these  bits  are  in  P  2  ,  and  vy-ill  eventually  flow  to  P  t  contributing  to  the  information  exchange. 

(ii)  p  ^  0  nlogn  .  If  p  ^  0  nZogn  ,  at  least  0  nZogn  —  |  nlogn  of  these  bits  are  output  by  ? ,  , 

and  observation  (d)  (with  q  =(0  —  y£)nZogn  ,  and  (£  —  i)logn  <  X  <  f  iogn  )  allows  the  con- 


elusion  that  in  the  interv&l  fr,,r2]  there  is  a  contribution  to  the  information  exchange 


7  'i  ^  (3  —  y|  —  <r)  nlogn.  4  4.63 

Now,  if  we  chose  0  =  (£  —  e)/2  ,  then  /  *o-  /  “i ,  and  in  either  case  the  interval  [r^rj  contributes 
7  *0  =  /  ‘i  *  ((£  —  €)/2  —  yi  —  O’)  nlogn  bits.  4.64 

Recalling  Ineq.  4.61,  we  see  that  the  global  contribution  of  the  L  intervals  of  the  decomposition  is 

7 ’  >  (( 1  - e/£)/2 ~y~ <r/£) £ \(r %  ) n.  4.65 

A  =1 

If  we  chose  £  =  «(cr/€  +  l/2)Acr/€  —  1/2) ,  the  coefficients  of  bounds  4-59  and  4.65  become  equal, 
and  by  summing  the  contributions  of  the  sequences  of  r and  t ",  we  obtain 

7  ^  ( 1  —  y  —  cr/e)  n  4.66 

where  we  have  used  Eq.  4.57.  A  linear  combination  of  bounds  4.50  and  4.66  with  coefficients 
(l  —  y  cr/e)  and  y  respectively,  yields 

7  >  y(l  —  y/(l  —  cr/e)) m.  4.67 

We  now  chose  y  *  (1  —  o7e)/2  to  maximize  the  right  hand  side,  so  that  4.67  becomes 

7  (m  I  c Tnlogn  ,  oo)  ^  1  /4 1 1  —  <r/e)  m ,  4.68 

where  we  have  used  the  appropriate  notation  for  the  information  exchange.  At  this  point  we  are 
ready  to  apply  Theorem  17,  which  states  that 


r  ^  7  {Ml2 /A  !  h~,  oo)/Al. 


4.69 


la  our  case  M  » (k  -  lognin,  and  l  =  \/cr  nlogn  ,  and  we  can  rearrange  bounds  4.69  and  4.68  as 


.AT  ^  — 
16 


1-  * 


'Jcrik  —  logn  )  n  \  nlogn  . 


4.70 


For  a  given  <r  .  the  best  bound  is  obtained  by  maximizing  t  But  e  is  subject  to  the  constraint 
$  =  etc r/e  *  l/2)/(cr/€  —  1/2)  ^  1  ,  so  that  cr/e  <  (1  +  e)/(2(l  —  e)) .  Under  this  constraint  the 
lower  bound  on  .AT  is  maximized  by  the  choice  €  =  l/'/12  .  and 
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<T  =  (n/I 2  +  1)/(2(12  +  >/l2 )).  □ 

From  Theorem  4.19  and  Theorem  4.18  we  know  that  there  exist  constants  /3t  and  02  such  that  the 
performance  of  any  (nJfc)-sorter,  with  k  ^  2  logn  satisfies  the  bounds  AT  ^  (3t  kn  'Jrdogn  ,  and 
AT :  ^  /32  (nlogn )  .  These  bounds  coincide  at  time  T0  »  (fa/fijnlogn  .  The  AT  bound  is 
stronger  for  T  >  T0 ,  and  the  AT2  bound  is  stronger  for  T  <  T0 . 

We  complete  the  discussion  of  this  section  with  two  simple  results  on  the  minimum  area  and  the 
minimum  computation  time  for  sorter  of  long  keys.  The  first  result  has  been  originally  proved  by 
(LS4l  but  it  is  also  a  trivial  corollary  of  Theorem  4.16.  The  second  result  follows,  as  usual,  by  simple 
fan-in  considerations. 

Theorem  420.  The  area  of  any  VLSI  (n  Jc  )-sorter,  with  k  ^  2  logn  ,  satisfies  the  bound 

A  =  Q  (.nlogn  ).  4.71 

Theorem  4.21.  The  computation  time  of  any  VLSI  (n  Jc  )-»rter.  with  k  2  logn  ,  satisfies  the  bound 

T  =  Clilogn  -f  log  k  ).  4.72 

Remark.  Bound  4.72  is  indeed  satisfied  by  all  sorters,  but  the  dependence  on  k  becomes  relevant  only 
for  very  long  words;  Le,  when  log*  =  Q(logn)  .  We  conclude  Section  4.2  summarizing  the  main 
results  in  Table  4.1. 

4  J  AREA-TIME  LOWER  BOUNDS  FOR  THE  COMPARATOR-EXCHANGER 

Usually  comparison-exchange  is  formulated  as  a  problem  whose  input  consists  of  two  keys  X 
and  X  j  ,  and  whose  output  consists  of  two  keys  Y  0  and  Y ,  such  that 


Y  o  =  min  (XoJCiX 

4.73 

Y !  =  max  (X o»X  i). 

4.74 

Comparison-exchange  is  an  interesting  operation  in  its  own  right,  and  it  is  a  primitive  of  many  sorting 


TABLE  4.1.  SUMMARY  OF  LOWER  BOUNDS  FOR  in  Jc  )-SORTING 


Length  of  the 
Lower  Bound  Keys 

Techniques 

<togn 

(r  *  2l  ) 

logn  <k  <2logn 

2 logn 

i  Bipartition 

i  + 

!  Information  Exchange 

AT 2  *  G(r  ^og^  1  +n  /r  )) 

AT 2  =  G(n2/i2) 

AT  2  =  G(n  2log2n ) 

(  Square  Tesselauon 

i  + 

i 

I  Information  Exchange 

AT2  *  O(nr) 

AT 2  *  Clink  ( nlogn  )) 

Square  Tesselation 

!  Saturated  Information 
Exchange 

AT  —  Clin  -Jr  ) 

— 

AT  *  Clink  'J nlogn  ) 

1 

j  Storage 

1 

A  *  Cl(r[og(l+n  lr  )) 

A  *  Clinlogn  ) 

!  Bounded  Fan-in 

i 

T  =  Clilogn ) 

i 

T  -  Cl (Logn  )  17**  Clilogn  -rlogk  ) 

_ 1 _ 

algorithms.  However,  our  main  motivation  to  study  its  area-time  complexity  comes  from  the  fact  that 
com panson -exhcnage  is  indeed  the  (2Jfc)-sortiflg  problem,  and  its  analysis  will  provide  us  with  useful 
insight  into  the  phenomena  that  determine  the  complexity  of  sorting  when  the  length  k  of  the  keys  is 
very  large  with  respect  to  their  number  n. 

The  lower  bound  technique  that  we  shall  adopt  is  different  from  the  ones  we  have  applied  in  the 
preceding  sections,  and  it  is  based  on  the  notion  of  functional  dependence. 

The  notion  of  functional  dependence  has  been  introduced  in  the  context  of  VLSI  computation  by 
Johnson  [JhSO] .  In  order  to  derive  an  area-time  lower  bound  for  binary  addition,  a  problem  similar  to 
comparison-exchange  in  several  respects. 


Let  us  now  recall  the  formal  definition  of  functional  dependence. 

Definition.  Given  a  function  y  -  f(x),  where  and  y  -  (>•  j, . . .  ,yf )  are  boolean  vectors, 

we  say  that  yj  is  functionally  dependent  on  xi  if  there  exist  two  boolean  vectors  x '  and  x  *  that 
differ  only  in  the  £  -th  component,  such  that  y ‘=f  (x  ’) ,  and  y  *=f  (x  *)  differ  in  the  j  -th  component. 

Example.  In  the  comparison-exchange  problem,  YtJf  is  functionally  dependent  on  X,J  for  any 
j  ^  and  t  -  0, 1 ;  Y/t°  is  not  functionally  dependent  on  X,J  for  any  j  <  jo,  and  t  -  0, 1 . 

Time-Determinacy.  For  time-determinate  protocols  the  functional  dependence  of  y j  on  x,  implies  that 
x,  must  be  input  before  y  j  is  output,  because  there  are  input  instances  in  which  y }  remains  indeter¬ 
minate  until  x,  is  known.  However,  more  complex  phenomena  take  place  when  we  bring  into  the  pic¬ 
ture  the  asumption  of  bounded  fan-in.  which  was  not  needed  when  analyzing  the  aspects  of  the  compu¬ 
tation  that  depend  only  on  information  exchange. 

Bounded  Fan-In.  We  explicitly  assume  now  that  in  our  circuits  the  gates  that  compute  boolean  func¬ 
tions  have  a  number  of  input  lines  upper-bounded  by  a  constant  f ,  .  As  is  well  known,  this  assump¬ 
tion  implies  that  if  an  output  variable  y  is  functionally  dependent  on  s  input  variables,  then  at  least 
rlogJr;  j  time  must  elapse  between  the  instant  when  the  first  of  the  input  variables  is  read,  and  the 
instant  when  y  is  output,  where  r  is  the  minimum  delay  of  a  boolean  gate,  and  f  j  is  the  maximum 
fan-in.  Hereafter,  since  the  value  of  r  and  /  /  affects  only  constant  factors,  we  assume  for  simplicity 
that  r  =  1  and  / ;  =  2. 

Computational  Friction.  Although  the  previous  considerations  are  often  useful  to  bound  the  computa¬ 
tion  time  of  some  problem,  they  do  not  exhaust  all  the  consequences  of  functional  dependence.  In  fact, 
if  s  variables  x  4, . . .  .x,  .are  input  at  the  same  time,  and  if  there  happen  to  exist  s  output  variables 
y  lt . . .  ,yt  such  that,  not  only  each  y  j  depends  on  all  s,  ’s  but  also  the  Jr's  carry  /  bits  of  information  on 
the  x’s,  the  system  must  be  capable  of  storing  I  bits  for  at  least  logs  time  steps.  Thus,  if  we  make  an 
analogy  in  which  the  information  is  viewed  as  a  fluid  flowing  from  input  ports  to  output  pom,  we  can 


say  that  the  functional  dependence  acts  as  a  kind  of  friction  that  slows  down  the  flow,  keeping  it 
below  capacity. 


In  a  VLSI  system  the  I/O  capacity  is  determined  by  the  area,  and  implies  the  trivial  I/O  bound 
AT  —  Q  ( input  size  +  output  size),  already  discussed  in  Section  3.1.  When  functional  dependence 
plays  a  role  in  slowing  down  the  I/O  information  flow,  we  intuitively  expect  the  AT  measure  to  satisfy 
a  stronger  lower  bound. 

This  in  indeed  the  case  for  comparison-exchanger,  as  shown  in  the  following  theorem. 

Theorem  4.22.  Any  comparator-exchange  of  keys  of  length  k  satisfies  the  bound 

.4  =  n((*/niog(*/n),  4.75 

which  can  be  also  rewritten  as 

AT  /log  A  =  Cl(k  ).  4.76 

Proo  f .  For  t  •  1.  2, ....  7,  let  Sit)  be  the  set  of  bits  of  X0  and  X  j  that  are  input  exactly  at  time  t,  and 
\ 

let  s  U  )—  I  5  (r )  I  .  We  partition  SU)  into  two  subsets  S0(r )  and  S  )  of  equal  size  sit)/ 2,  the 
significance  of  the  bits  of  S  j(f )  .  We  consider  now  the  set  C0(t )  containing  all  the  output  bits  that 
belong  to  a  position  j  such  that  at  least  one  of  X  o  and  X  f  is  input  exactly  at  time  t.  Formally, 

C0(r)^lroJf(i  €S0U)  or  X{  6S0CO}U{ritXj{  €S0(f)  or  X;6S0U)X 

On  set  C  0(: )  we  can  make  two  important  observations: 

(i)  All  variables  in  C0(t )  are  functionally  dependent  on  all  variables  in  S  ,(.t ) .  Therefore,  no  vari¬ 
able  in  C0it )  can  be  output  before  time  t  -  logisit)/ 2). 

in)  From  the  value  of  the  variables  in  C0(t )  -  possibly  with  the  addition  of  one  extra  bit  specifying 


whether  X or  X  i  is  the  smaller  key  -  we  can  uniquely  reconstruct  the  value  of  the  variables  in 
S.it ) .  Therefore,  from  time  t  to  the  time  when  the  first  variable  of  C0(r )  is  output,  at  least  sit) '2 
bits  of  information  concerning  S.Jj  )  must  be  stored  by  the  system. 


Combining  these  two  observations,  we  conclude  that  the  s(t)  variables  input  exactly  at  time  r  give  a 
contribution  of  at  least  fsit)/2)logfsit)/2 )  bit  X  time  unit  to  the  storage  x  time  product,  and  hence  to 
the  AT  product.  Thus; 

AT  2  £r(r)/2  log  (j  it  )/2),  4.77 

r  «l 

T 

where  obviously  £  sit )  =  2k.  Under  this  constraint,  the  right  hand  side  of  4.77  is  minimized  when 
/*  i 

sit )  -  2k/T  for  each  r  -  1 . T.  Thus, 

AT  2  k  logik/T)  4.78 

which  proves  4.75.  This  can  be  also  rewritten  as  4.76  after  simple  algebraic  manipulations.  □ 

It  is  worth  comparing  Theorem  4.22  with  Johnson’s  work  [J80]  from  which  we  have  borrowed 
the  main  idea,  in  order  to  clarify  two  superficial  differences.  First,  although  the  area-time  complexity 
of  binary  addition  is  exactly  the  same  as  the  complexity  of  comparison-exchange,  [J80]  states  the  lower 
bounds  in  a  form  different  from  (  and  probably  less  clear  than)  Equations  4.75  and  4.76,  in  an  attempt 
to  formulate  the  results  as  a  bound  on  a  measure  of  the  ATa  type.  Second,  while  in  our  proof  we  bound 
essentially  the  amount  of  time  that  the  input  information  must  spend  inside  the  system.  Johnson 
bounds  from  below  the  duration  of  intervals  during  which  a  given  amount  of  information  is  output  by 
the  system.  However,  this  difference  is  only  superficial,  because  it  is  obvious  that  when  the  storage  has 
been  saturated  by  inputs  on  which  the  system  is  still  performing  some  computation,  both  the  input 
flow  and  the  output  flow  must  necessarily  slow  down. 


CHAPTER  5 


ALGORITHMS  AND  ARCHITECTURES 

5.1  INTRODUCTION 

In  Pan  D  of  this  thesis  we  turn  our  attention  to  the  design  of  VLSI  sorting  circuits.  A  VLSI 
design  has  two  fundamental  aspects:  the  algorithm  and  the  architecture.  Both  aspects  have  been  exten¬ 
sively  investigated  by  many  researchers,  and  the  valuable  knowledge  that  has  been  accumulated  is 
very  useful  for  our  study  of  VLSI  soning. 

In  Chapter  5  we  review  known  parallel  soning  algorithms  and  known  parallel  architectures  that 
will  be  basic  ingredients  of  our  sorters.  An  efFon  has  been  made  to  give  a  unified  presentation  of  the 
subject,  but  there  is  no  attempt  to  make  an  exhaustive  survey.  Only  algorithms  and  architectures  that 
we  actually  use  in  subsequent  chapters  are  indeed  described. 

In  Chapter  6  we  propose  a  variety  of  optimal  sorters  for  keys  of  length  k  -  logn  +  (Hlogn  ).  The 
designs  are  all  based  on  previously  known  algorithms  or  simple  variants  thereof,  and  their  novelty  con¬ 
sists  in  the  development  of  the  appropriate  architectures  amenable  to  compact  layouts: 

In  Chapter  7  we  consider  arbitrary  key  lengths  and  propose  optimal  or  near-optimal  designs. 
Several  algorithms  are  new,  and  in  fact  it  can  be  proven  that  for  certain  ranges  of  key  lengths  none  of 
the  classical  sorting  algorithms  can  achieve  area-time  optimality. 

The  performance  of  the  proposed  design  is  contrasted  with  the  appropriate  lower  bounds.  How¬ 
ever.  the  presentational  subdivision  into  lower  and  upper  bounds,  which  considerably  simplifies  the 
exposition,  does  not  reflect  the  real  development  of  the  problem  analysis.  An  attempt  to  better  relate 
lower  bounds  and  upper  bounds  is  made  in  Chapter  8  where  the  mam  results  of  the  entire  thesis  are 
summarized  and  compared  with  each  other. 
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5.2  PARALLEL  ALGORITHMS  FOR  SORTING 

In  this  section  we  review  some  Jtnown  algorithms  for  parallel  sorting,  which  will  be  implemented 
in  the  design  of  VLSI  sorters. 

5.2.1  The  Combination  Scheme 

Several  sorting  algorithms  can  be  viewed  as  particular  cases  of  a  rather  general  scheme,  which  we 
now  describe. 

We  call  combination  the  operation  that  produces  from  m  sorted  sequences  of  l  elements  each  one 
sorted  sequence  of  ml  elements  A  network  implementing  this  operation  is  called  an  (m2)-combiner. 
When  m  -  2,  combination  reduces  to  merging. 

Given  n  -  m  tm  3 . . .  m  3  elements,  we  can  sort  them  in  d  stages  according  to  the  following  scheme 
that  we  call  combine-sort. 

At  stage  1  we  perform  n  /m ,  combination  operations,  each  on  m  x  sequences  of  1  element  each.  At 
stage  2  we  perform  n/m,m;  combinations,  each  on  m  3  sequences  of  m  x  elements  each,  and  at  stage  i  we 
perform  n  /mx...m,  combination,  each  on  m,  sequences  of  length  m  j . . .  m,  _t.  Finally,  at  stage  d  we 
combine  md  sequences  of  length  n  lmd  into  one  sequence  of  length  n.  which  is  the  output  of  the 
combine-sort  scheme.  A  diagrammatic  illustration  of  the  scheme  is  given  in  Figure  5.1  in  the  form  of  a 
rooted  tree.  Each  node  of  this  tree  is  a  suitable  combiner.  An  (m,  2,  _t)-combiner,  1  ,  performs 

the  combination  of  (sorted)  sequences  of  length  here  Z0  *  1  and  l,  _i t/n3  •  •  •  mi^l  for  i  >  1. 
Note  that  each  level  of  the  tree  corresponds  to  a  stage  of  the  combination  scheme,  and  that  there  are 
n,  /l,  nodes  at  level  i ,  1  ^  i  ^  d . 

Several  known  sorting  algorithms  can  be  cast  in  the  combine-sort  scheme.  Each  algorithm  is 
characterized  by  a  particular  factorization  of  n  =  m  (note  that  the  order  of  the  factors  is 

relevant  here),  and  by  the  specification  of  how  the  combination  is  to  be  performed. 

We  shall  discuss  some  important  algorithms  in  the  following  sections. 
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Two  important  algorithms  called  bitonic  merge  and  odd-even  merge  have  been  proposed  by 
Batcher  [Ba68l  Originally  formulated  for  a  network  of  comparators,  both  agorithms  are  also  amenable 
to  efficient  VLSI  implementation. 

We  shall  make  systematic  use  of  bitonic  merge,  and  correspondingly  of  bitonic  sort,  which  exhibit 
a  high  degree  of  symmetry  in  the  pattern  of  data  interaction.  Therefore,  these  algorithms  are  com¬ 
pactly  described  below,  with  these  conventions:  Altoi-l]  is  the  input  array;  d  is  a  binary  parameter 
specifying  either  increasing  (d-0)  or  decreasing  order  (d-l);  COM PEXiaJb;d )  is  a  primitive  operation 
which  rearranges  two  numbers  a  and  b  in  increasing  or  decreasing  order  depending  upon  the  value  of  d. 
The  array  ^4[(hn-l]  is  sorted  by  a  call  B-SORT( A{Ovi-llO)  of  the  following  procedure  (where  n  and  b 
are  powers  of2)^l) 
procedure  B-SORTC  Alii+6- \\d) 

begin  if  b-2  then  (.AfcHfc+lD  *“  COMPEX(^iL4(i+lld>, 

else  begin  B-SORTUfci  +  A  -  1)10),  B-SORTCdfc  +  t  +  Mil* 

B-MERGMa+Mld) 

end 

end 

procedure  B-MERGE(A{iaVMtd) 
begin  if  b-2  then  Cdfcl/tfi+l])  -  COMPEXU{iL4fc+ 1  id) 
else  begin  for  each  0  <  j  <b  /2  pardo 

(Ali+jlAli  +  A  -  1+jD  -  COMPEX(/U;+;U(i  +  £  -  l+jU>, 

+•  £ 

B-MERGE(4t.i  +  j  -  lid),  B-MERGE(A(t  -r  t  a+Mld) 

end 

end 


i.i;  In  the  following  algol-like  program  commas  are  used  to  separate  concurrent  steps,  and  semicolons  are  used  to  separate 
steps  to  be  sequentially  executed. 
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Odd-even  merge  is  also  a  very  interesting  algorithm,  but  we  do  not  describe  it  here,  since  we  will 
not  make  direct  use  of  it  in  our  constructions.  However,  in  Section  5.2.4,  we  shall  describe  the 
multiway-shuffle-combination  algorithm,  from  which  odd-even  merge  can  be  obtained  as  a  special  case. 


■ 


i 


5.2.3  Merge-Enumeration  Combination 

A  parallel  algorithm  for  the  (mi)-combiner  which  we  call  merge-enumeration  has  been  intro¬ 
duced  in  [P78]  and  is  based  on  the  following  ideas.  The  m  input  sequences  S  0  •  •  ■ .  -1  are  pairwise 

merged  to  compute  for  each  £  ,j  €  {0,1, . . . ,  m  —  1 },  and  each  h  €  {0,1 . 1  —1 1,  and  the  number  C,;  {h  )  of 

elements  of  sequence  S,  that  are  less  than  the  h- th  element  of  sequence  S, .  C,jUi )  is  readily  obtained 
as  the  difference  of  the  ranks  of  this  element  in  the  merge  of  S,  and  S;  and  in  S;  in  the  output 
sequence  of  the  combiner;  thus,  to  complete  the  operation,  we  simply  need  to  store  each  element  in  the 
position  specified  by  its  rank.  The  primitive  operation  of  the  scheme  —  the  merging  of  two  sequences  — 
can  be  done,  for  example,  by  Batcher’s  bitonic  merger. 

It  can  be  shown  [P78]  that  a  proper  implementation  of  merge-enumeration  combination  runs  in 
time  0(log(mZ )),  that  is,  in  time  logarithmic  in  the  size  of  the  output  sequence. 

A  very  interesting  case  of  the  algorithm  is  obtained  when  l  -1  so  that  each  of  the  input 
sequences,  S  . . .  JSm  consists  of  just  one  element,  and  merging  degenerates  to  comparison-exchange. 
Instead,  in  this  case  the  combiner  itself  becomes  a  sorter,  and  -  more  specifically  -  the  Muller -Pre parata 
sorter  orginally  proposed  in  [MP75]. 

In  (P78l  the  merge-enumeration  combination  has  been  introduced  to  construct  sorting  algorithms 
for  the  shared-memory  machine,  that  run  in  Oilogn  )  time  and  require  for  their  execution  the  smallest 
possible  number  of  processors.  In  fact,  Preparata  has  shown  that,  by  choosing  for  the  combine-sort 
scheme  the  vaiues  d  ■  loglogn/log(l/(l-a  )),  and  md_,  =  n  '  with  0  <  a  <  1.  the  resulting 
sorting  algorithms  can  be  executed  in  time  0(log  n/at  )  by  0(n1~'*)  processors.  The  sorting  scheme 
corresponding  to  a  given  or  can  be  described  as  follows.  The  n-input  sequence  is  split  into  na  (m.  in 
our  terminology;  sequences  of  na~°'  in  our  terminology)  elements  each.  These  sequences  are 
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sorted  recursively,  and  then  combined  by  an  (md  JLd  Combiner.  The  recursion  stops  when  sequences 
of  length  1  are  obtained.  We  can  obtain  the  values  for  d  and  mx,...,md  by  a  simple  analysis  of  the 
unfolded  recursive  process. 

-  In  Section  6.3  we  shall  explore  new  significant  choices  of  d  and  that  minimize  the 

complexly  of  a  VLSI  implementation  of  merge-enumeration  combine-sort. 

5.2.4  Multiway-Shuffle  Combination 

The  Multiway-shu fie  -combination  algorithm  has  been  introduced  by  Leighton  [L84]  (under  the 
name  of  column  sort),  and  is  a  generalization  of  the  odd-even  merge  of  Batcher  [Ba68]. 

We  recall  that,  if  N  =  N  XN  >  the  N ,  -shuffle  is  defined  as  follows: 

N  j-shuffle  (0,1, ....  AT  — l)  - 

(0,  N  2.  • . .  .(  V  ,-l)jV  2, 1,  N  2+l . (N  ,-l)  N  2+1, . . .  JV  2-1. 2N  2-l . N  -l) . 

The  N  ,-unshuffle  is  defined  as  the  inverse  permutation  of  the  N  ,-shuffle.  It  is  easily  seen  that,  on 
N ,  N  2  elements,  the  N  ,-shuffle  is  identical  to  the  N  2-unshuffle.  A  simple  way  to  obtain  the  N , 
shuffle  of  a  sequence  of  N ,  N  2  elements  is  to  write  the  sequence  into  an  N  ,XiV  2  array  in  row  major 
order,  and  read  the  same  array  in  column  major  order.  (See  Figure  52.) 

We  are  now  ready  to  describe  the  multiway-shuffle  combination,  which  is  also  illustrated  by  a 
block  diagram  in  Figure  5.3.  S  „ . . .  ,Sm  are  the  m  sorted  sequences  of  l  elements  each 

S,  =  (ri(0U(l)....,jia-l)),  i  =0,...,m-l  5.1 

and  they  have  to  be  combined  in  the  sorted  sequence 

5  =  (r(0), j(l),..., s(mZ— 1))  -  combination  52 

The  algorithm  consists  of  the  following  stages. 

1.  Apply  a  />-unshuffle  to  the  sequence  of  ml  elements  obtained  by  concatenating  S  o.  S ,....,  Sm  If 

we  define  the  subsequences 
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UNCLASSIFIED 
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sequence  :  (0, 1.  2,  3.  4,  5,  6,  7,  6.  9.  10, 1 1 ) 


write  in  row-major 

0  12  3 
4  5  6  7 

8  9  10  11 

+ 

read  in  column-major 
3-shuffle  :  (0.  4.  8. 1.  5.  9,  2.  6, 10.  3.  7. 11 ) 

Figure  5.2.  Array  definition  of  the  multiway  shuffle  (.N  j  *  3,  N  2  *  4). 

=  It,  (all,  (a  +p\...,s,  a  =  0,...,p-l  5-3 

it  is  easy  to  see  that  the  output  of  the  p  -unshuffle  is  the  concatenation  of 

•S oo» ^  io« . • . , Sm _i.o » 5 oi, S  ii,..., Sm _ij, 5 ot^ _i, S  i ,p  —i, •  •  • , Sm -i^  — i- 

2.  For  a  =0,1, . . . ,  p  —1,  recursively  combine  the  m  sequences  S^.S  . . Sm _ u  to  obtain  a  sorted 

sequence  Ua . 

3.  Apply  a  /vunshuffle  to  the  concatenation  of  U o,U f-i  and  call  the  result  U. 

4.  For  a  given  even  integer  w  £  2(m  —l)(p  — 1),  which  divides  mL,  split  U  into  ml/w  "windows*  of 
w  consecutive  elements  and  son  each  window.  Call  the  resulting  sequence  U ' . 

5.  Split  sequence  U '  except  the  first  vr/2  and  the  last  w/2  elements,  into  ml/w- 1  windows  of  w  ele¬ 
ments.  and  sort  each  window.  After  this  operation,  we  obtain  S,  the  result  of  the  combination. 

The  basic  property  of  the  multiwav-shuffle  combination  that  justifies  the  correctness  of  the  algorithm,  is 
the  following. 
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p— Unshuffle  of  ml  Elements 


(m.  I/p)  Combiner 


(m.  I/p)  Combiner 


p-Shuffle  of  mi  Elements 


Sorter  of 
w  Elements 


Sorter  of 
w  Elements 


Sorter  of 
w  Elements 


Sorter  of 
w  Elements 


(m.  I/p)  Combiner 


Sorter  of 
w  Elements 


Sorter  of 
w  Elements 


Figure  5.3.  Block  diagram  of  multiway-shuffle  combiner.  Single  lines  carry  one  element,  double 
lines  carry  a  sequence  of  l  Kmp )  elements. 

Property.  If  a  given  input  element  has  position  hv  in  sequence  U  and  position  hs  in  sequence  S, 


\hv  —  hs  \  <  (m -l)(p  — 1)  . 

More  specifically,  if  hv  »  bp  +  a,  for  some  b  -  0,1, ...  jrd Ip -1,  then 


Remark.  As  we  have  aready  mentioned,  the  preceding  algorithm  is  essentially  equivalent  to  the  one 
proposed  in  [LS4].  The  proof  of  the  above  property  is  also  similar  to  the  one  proposed  by  Leighton,  and 
it  is  therefore  omitted  here.  However,  our  description  of  the  algorithm  is  rather  different  from  the  one 
given  in  [LS4],  because  we  do  not  restrict  ourselves  to  the  case  p-m,  a  case  in  which  the  unshuffling  and 
the  shuffling  operations  can  be  described  as  row-major  to  column-major  transpositions  of  a  suitable 
l  xm  array  where  the  input  sequences  are  orginally  placed. 

Remark.  Odd-even  merge  is  the  special  case  of  multiway-shuffle  combination  obtained  for  m  -  p  -  2. 

A  simple  analysis  shows  that  the  running  time  of  multiway-shuffle  (m/j-combination  when  p  • 
m,  is  T  -  O  (( logl  flogm  YTW<  (m  2) ),  where  is  the  running  time  of  the  sorting  algorithm  that  we 
chose  to  deploy  in  stages  4  and  5. 

A  combine-sort  scheme  based  on  multiway-suffle  combination.  with 
«,  =  «!,=  •••  *  m  ( a  constant  independent  of  n).  and  recursively  using  multiway-shuffle 

to  sort  the  windows  in  steps  4  and  5  results  in  a  running  time  T  *  0(  log2n  ).  The  net  result  is  a 
rather  cumbersome  method  to  obtain  the  same  performance  as  can  be  achieved  by  a  simple  merge-son. 

Nevertheless,  multiway-shuffle  combination  is  a  remarkable  algorithm,  and  it  turns  out  to  be  very 
useful  in  some  VLSI  designs. 

5.2J  The  AKS  Network 

Ajtau  komios.  and  Szemeredi  [AKS83]  recently  proposed  a  sorting  network  (referred  to  hereafter 
as  the  AKS  network),  of  0(n  logn )  comparators  and  CXlogrt)  depth.  Their  construction  is  of  great 
theoretical  interest,  for  it  shows  that  0(n  logn  )  comparisons  suffice  to  son  n  elements,  even  under  the 
constraint  that  comparisons  be  nonadaptivelv  executed  in  CXlogn).  parallel  stages.  At  present,  the  .AKS 
network  appears  not  suitable  for  practical  implementation,  due  to  the  large  value  of  the  constants; 
however,  improvements  are  conceivable  that  could  make  the  network  more  attractive  for  real-world 


applications. 


The  full  description  of  the  AKS  network  is  too  complex  to  be  reported  here. 

5^6  Summary 

The  notion  of  (m^)-com  bins  tion  provides  a  common  framework  to  classify  several  known  algo¬ 
rithms  far  parallel  sorting. 

|  In  a  trivial  sense,  every  sorting  algorithm  falls  in  the  combine-sort  scheme,  since  an  (m.1)- 

combiner  is,  by  definition,  a  sorter  of  m  elements  Indeed,  both  the  Muller-Preparata  and  the  AKS  algo¬ 
rithms  can  be  viewed  as  combination  schemes  only  in  this  trivial  sense.  In  fact,  there  is  no  intermediate 
I  stage  of  these  algorithms  at  which  the  input  multiset  is  partitioned  in  sorted  blocks. 

However,  we  have  also  seen  non-trivial  examples  of  combination-sort  including  bitonic  and  odd- 
even  merge-sort,  merge-enumeration  combination,  and  multiway-shuffle  combination. 

In  the  algorithms  we  have  just  cited  the  same  method  is  used  to  perform  the  combinations  at  all 
stages  of  the  combine-sort  scheme.  However,  different  methods  can  be  used  at  different  stages,  obtaining 
algorithms  that  we  could  generally  call  “hybrid  combine-sort*.  Hybrid  algorithms  are  indeed  useful  in 
^  VLSI  applications,  as  we  shall  see  in  forthcoming  sections. 

We  conclude  this  section  with  a  graphic  summary  given  in  Figure  5.4. 

I 

5-3  PARALLEL  ARCHITECTURES 

A  parallel  algorithm  is  executed  by  a  parallel  architecture,  which  is  a  set  of  proceaors  connected 
*  by  data  paths.  When  focussing  on  the  interconnection  pattern,  the  architecture  can  be  formally  viewed 

as  a  graph  whose  vertices  correspond  to  processors,  and  whose  edges  correspond  to  data  paths.  Infor¬ 
mally,  we  shall  often  refer  to  such  graphs  as  networks,  or  computation  graphs. 

| 

In  the  design  of  a  VLSI  system  for  the  solution  of  a  given  computational  problem,  both  the  algo¬ 


rithm  and  the  architecture  can  be  chosen  to  minimi™  the  area-time  complexity. 
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(m,  I)— Combination 


Figure  5.4.  Hierarchy  of  fundamental  operations  in  parallel  sorting.  A  solid  arrow  points  toward  a 
subcase  obtained  by  specializing  the  parameters  defining  the  size  of  the  input  operands. 
A  dashed  arrow  points  toward  a  subcase  obtained  by  specifying  the  algorithm  by  which 
the  operation  is  performed. 


Since  the  solutions  of  different  problems  requires  different  algorithms,  it  is  a  prion  quite  plausible 
that  they  also  require  different  architectures.  However,  the  experience  gained  by  recent  research  in  the 
field  of  VLSI  computation  (a  representative,  but  by  no  means  exhaustive,  list  of  reference  is:  [BKSll 
[BK821,  [BP84al  [BP84b],  [BS84],  (GKT791  [Ku82i  [LSlal  [LS3l  [LsSOai  [\le83i  [MP841  [\MB83i 
[PV80],  [PV81al  [PV81bi  [T80L  [T83al  [T83bD  shows  that  in  several  cases  algorithms  for  different 
problems  can  be  efficiently  executed  by  the  same  achitecture.  (For  example  the  radix*2  Fourier 
transform  and  bitonic  merging  are  both  efficiently  executed  by  the  shuffle-exchange  network.) 

A  detailed  analysis  of  these  cases  reveals  that  although  the  nature  of  the  operations  performed  on 
the  input  data  may  be  radically  different,  the  pattern  according  to  which  the  processors  exchange  data 
among  each  other  is  exactly  the  same. 
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Prepanu  [P84]  proposes  to  classify  algorithms  according  to  paradigms,  determined  exclusively  by 
their  communication  patterns,  and  to  characterize  the  architectures  in  relation  to  the  paradigms  that 
they  can  execute  efficiently. 

When  a  paradigm  encompasses  algorithms  for  several  useful  problems,  the  supporting  architecture 
can  be  considered  of  the  broad-purpose  type,  its  capabilities  being  intermediate  between  those  of  a 
special-purpose  achitecture,  exclusively  dedicated  to  a  given  task,  and  those  of  a  general-purpose  archi¬ 
tecture  that  can  execute  any  conceivable  algorithms. 

As  observed  in  [P84l  the  results  emerging  from  current  research  on  the  design  of  efficient  VLSI 
systems  for  the  solution  of  fundamental  computational  problems  strongly  suggest  that  a  few  powerful 
and  highly  regular  architectures  can  be  used  to  satisfy  a  majority  of  computational  requirements. 

The  study  of  the  sorting  problem  confirms  this  indication.  As  we  shall  see  in  the  next  chapters, 
all  the  known  basic  broad-purpose  architectures,  alone  or  combined  in  novel  ways,  are  instrumental  to 
obtain  VLSI  sorters  with  optimal  area-time  performance.  For  this  reason,  we  briefly  review  them  in 
the  remainder  of  this  section. 

5.3.1  The  Binary  Cube  and  its  Emulators 

The  binary  cube.  The  v  -dimensional  binary  cube  [Pe77]  is  a  network  ofiV  =  2‘‘  processors 
labelled  from  0  to  N- 1  as  Pi. 0),  Kl), . . .  J*iN- lX  with  a  direct  connection  (called  a  link  )  between  each 
pair  of  processors  whose  binary  numberings  differ  in  exactly  one  position.  If  we  let  C,(h )  be  the 
integer  obtained  by  complementing  the  coefficient  of  V  in  the  binary  representation  of  integer  h,  then 
the  ;-th  dimension  of  the  cube  is  the  9et  of  edges  £,  -Kh  ,  Cj  ih  )) :  0  ^  h  <  N  }.  (See  Figure  5.5) 

Among  the  algorithms  supported  by  the  binary  cube  are  those  whose  input  is  an  array  of  data 
AiOi  A{ll  •  •  •  «A(iV-l]  with  component  A  [i }  intially  loaded  in  processors  PH),  and  whose  execution  con¬ 
sists  of  a  sequence  of  steps  such  that  at  a  given  step,  only  the  edges  of  one  dimension  are  active.  A  pair 
of  processors  connected  by  an  edge  of  that  dimension  exchange  their  data  and  operate  on  them. 


o 


:o: 


Figure  5  J.  The  binary  cube  and  its  dimensions.  (\«8,v  -3). 


Such  an  algorithm  belongs  to  a  paradigm  that  can  be  easily  described  by  giving  the  sequence  of 
the  dimensions  in  the  same  order  as  they  have  to  be  activated.  With  this  convention,  we  can  define  two 
important  paradigms 


Ascend  :(£,>£, . E 


Descend  :(£„_„  f.,.*...,  £„).  5.7 

These  paradigms  have  been  introduced  in  [PVSla]  and  [PV81bi  where  the  reader  can  also  find  an 
extensive  list  of  problems  and  algorithms  complying  with  Ascend  and  Descend  or  simple  variants  there 
of.  The  recursive  structure  of  .Ascend  and  Descend  algorithms  is  also  elucidated  in  those  papers. 


If  the  operation  executed  at  each  step  takes  a  constant  amount  of  time,  both  the  Ascend  and  the 


Descend  algorithms  are  executed  by  the  binary  cube  in  (Xv)  =  (AlogN  )  time. 


Although  the  cube  is  a  theoretically  fundamental  network,  it  is  not  very  attractive  for  practical 
implementations,  because  the  number  of  edges  per  processor  inceases  with  N.  This  drawback  of  the 
cube  has  naturally  suggested  the  search  for  simpler  networks  capable  of  emulating  the  cube  without 
significant  loss  in  performance,  at  least  in  the  execution  of  algorithms  that  can  be  cast  in  the  Ascend 
and  Descend  paradigms.  We  describe  now  some  of  these  emulators. 

The  shuffle-exchange  network.  For  an  even  integer  N,  the  shuffle  permutation  is  the  bijective  function 

shuffle  (A)  -  2  h  mod  (AMX  for  h  €  {0, 1 . N- 2),  and  shuffle  (AM)  *  AM.  The  shuffle-exchange  is 

a  graph  with  N  vertices  labelled  from  0  to  AM,  and  with  two  kinds  of  edges;  the  exchange-edges, 

which  are  bidirectional,  and  connect  vertices  2h  and  2A+1,  for  A-0,1 . N/2-1,  and  the  transfer 

edges,  which  are  directed,  and  go  from  vertex  h  to  vertex  shuffle(AX  for  A  -  0,1 . AM.  (See  Figure 

5.6.) 

As  a  network  of  processors  P(.0\P(l\. — lX  the  shuffle-exchange  has  several  attractive 
features  [St7l]  most  of  which  are  summarized  by  the  fact  that  it  can  emulate,  in  a  simple  and  elegant 
way,  the  Descend  paradigm  of  the  binary  cube. 


Figure  5.6.  The  shuffle-exchange  graph  for  N  •  8. 


A  Descend  algorithm  is  in  fact  executed  by  the  shuffle-exchange  in  v  phases,  each  of  which  con¬ 
sists  of  a  transfer  step,  in  which  processor  h  sends  its  content  to  processor  shuffle(h)  along  the  transfer 
edge  (h,  suffle  ( h ))  and  an  operation  step,  in  which  processors  connected  by  an  exchange  edge  communi¬ 
cate  with  each  other  and  perform  operations.  An  interesting  property  of  the  shuffle  permutation,  when 
N  is  a  power  of  two,  is  that  the  binary  spelling  of  shuffle(h)  is  the  left  cyclic  shift  of  the  spelling  of  h. 
Therefore,  if  N  =  2"  ,  shuffle  (Cv_;  (h ))  =  C0  (shuffle  (h)),  which  means  that  h  and  C,_,  will  reside  in 
processors  connected  by  an  exchange  edge  after  j  executions  of  the  routing  step,  as  required  by  the  Des¬ 
cend  paradigm  on  a  v -dimensional  cube.  This  proves  that  the  shuffle-exchange  emulates  the  cube 
correctly.  Moreover,  since  the  v-th  power  of  the  shuffle  is  the  indentity,  after  v  steps  all  the  items  are 
back,  in  the  orginal  processors  (although  they  have  been  transformed  by  computation). 

The  inverse  permutation  of  the  shuffle,  known  as  vnshujjle,  is  also  interesting.  The  unshuffle- 
exchange  would  in  fact  emulate  the  .Ascend  paradigm  in  the  same  way  as  the  shuffle-exchange  emulates 
the  Descend  paradigm.  Indeed,  the  transfer  edges  of  the  shuffle-exchange  are  often  defined  as  bidirec¬ 
tional  so  that  both  of  shuffling  and  the  unshuffling  of  the  data  can  be  accomplished  in  one  transfer  step. 

In  this  case,  it  is  easily  seen  that  both  the  Ascend  and  the  Descend  algorithms  have  an  CilogN ) 
running  time  on  the  shuffle-exchange  network. 

As  to  the  area  performance,  the  shuffle-exchange  graph  can  be  laid  out  in  A  *  (XGV  ilogN  )2)  area, 
which  is  optimal  [KLLM83J. 

.An  attractive  feature  of  the  shuffle-exchange  network  is  the  simplicity  of  the  emulation  algo¬ 
rithm  consisting  of  an  alternation  of  transfer  steps  with  operation  steps,  the  transfer  steps  being  all  per¬ 
formed  according  to  the  same  permutation,  i.  e.  the  shuffle.  A  natural  question  is  whether  we  can 
obtain  other  emulators  of  the  descend  paradigm  by  using  permutations  different  from  the  shuffle.  This 
question  is  answered  in  [BJ84]  where  it  is  shown  such  permutations  exist,  but  they  are  so  closely  related 
to  the  shuffle  that  there  is  nothing  to  loose  in  restricting  our  attention  to  the  shuffle  itself. 

However,  there  are  other  interesting  emulators  of  the  binary  cube,  which  use  schemes  to  transfer 
the  data  more  complex  than  a  simple  permutation. 
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The  linear  array.  The  linear  array  is  a  network  of  N  processors  P  (Q)J*  (l\ . . .  J*  (N  —1),  with  a 
bidirectional  edge  between  P (t  —1)  and  P(i\  for  i  =1.2. ...JJ  —1. 

The  data  contained  in  a  linear  array  can  be  easiy  shuffled  (the  content  of  P(h)  is  sent  to 
P  (shuffle(/j)X  or  unshuffled  in  N/ 2-1  transfer  steps  (for  N  even)  as  shown  by  [PV81al 

If  N  -  2",  an  operation  step  requiring  the  use  of  cube  dimension  E j(0  ^  j  <  v)  can  be 
peformed  by  the  linear  array  as  follows:  (Refer  to  to  Figure  5.7.)  The  entire  array  is  decomposed  in 
N  /2y+1  subarrays  each  of  2J+l  consecutive  processors.  Each  subarray,  in  parallel  and  independently  of 
the  others,  will  shuffle  its  data  to  create  the  comet  adjacencies  required  by  the  cube  dimension  E  j ,  and 
to  allow  for  the  execution  of  the  operation  step.  Then,  the  original  order  of  the  data  is  restored  by 
unshuffling  the  subarrays.  This  process  requires  one  operation  step,  and  2  1)  *  2' +I— 2 

transfer  steps. 
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Figure  5.7.  Execution  of  operations  of  cube  dimension  E  2  with  a  linear  array  of  N  -  16  processors: 
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Thus,  the  Ascend  and  Descend  paradigms  can  be  executed  by  the  linear  array  in  v  operation  steps 

v—  1 

and  £(2;‘*l“2)<2  N  transfer  steps.  The  same  result  holds  for  any  cube  paradigm  in  which  all  the  v 

;=U 

dimemsions  are  activated  exactly  once,  in  any  arbitrary  order.  In  conclusion  we  obtain  T  -  0(N\  and 
clearly  A  -  (,V). 


The  rectangular  mesh.  The  (s  xr )  rectangular  mesh  is  a  two-dimensional  array  of  N  -  st  processors 
P,J.  i  -  0, . . . ,  s-1,  and  j  -  0, 1,  — ,  r-1,  where  each  row  and  each  column  is  interconnected  as  a  linear 
array. 

Let  N  -  2\  5  *  2°,  and  t  ~  2f  and  let  the  mesh  be  loaded  with  the  input  vector  A  [Ol . .  .  , 
A  [A"  — l]  of  the  Ascend  paradigm  in  column-major  order,  so  that  processor  P.‘  stores  A  [i  +5  j  ]  (refer  to 
Figure  5.8). 


If  the  processors  were  connected  as  a  cube,  with  cube  processors  P(i+  js)  correspondng  to  mesh 

processors  P,\  dimensions  En.E . . would  remain  associated  with  the  columns,  and  dimensions 

E . f^./cr+T  =  v )  would  remain  associated  with  the  rows  of  the  mesh  (see  Figure  5.8). 
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Figure  5.8.  A  (4x8)  rectangular  mesh.  When  the  indices  of  the  cube  processors  are  mapped  into  the 
array  m  column-major  order  dimensions  and  E  [  are  associated  with  columns,  and 
dimensions  £  >  £  >  and  £4  are  associated  with  rows.  {N  -  32.  s  -  4. :  -  8.  cr  -  2.  t  -  3. 1 


The  operation  step  associated  to  a  given  dimension  can  be  then  executed  by  the  same  technique 
used  for  linear  arrays.  If  0  ^  j  <  <7— 1,  £;  is  executed  by  suitable  subarrays  of  the  columns  in  2y+1— 2 
transfer  steps.  If  cr  ^  j  <  cr  +  r  —  1 ,  Ej  is  executed  by  suitable  subarrays  of  the  rows  in  2J  ~a  +  1 
transfer  steps: 

A  simple  analysis  shows  that  using  each  deminsion  once  takes  a  global  number  of  transfer  steps 
near-square  mesh  (s  -  t  -  \fW  if  v  is  even,  or  s  =  2r  =  'JN  /2  if  v  is  odd)  which  works  in 
<X>/N  )  steps. 

This  result  obviously  applies  to  Ascend  and  Descend.  It  is  useful  to  observe  that  the  execution  of 
an  Ascend  algorithm  on  the  rectangular  mesh  can  be  viewed  as  the  execution  of  an  Ascend  algorithm 
on  the  columns  followed  by  an  Ascend  algorithm  on  the  rows.  Similarly,  a  Descend  algorithm  on  the 
mesh  consists  of  a  Descend  on  the  rows  followed  by  a  Descend  on  the  columns. 

Summarizing,  a  near-square  mesh  of  N  processors  can  execute  algorithms  in  the  Ascend  and  Des¬ 
cend  paradigms  in  time  T  =0(  J~N  ).  The  layout  area  is  clearly  A-OOV). 

The  Cube-Connected-Cycles.  Referring  to  Figure  5.9,  an  (r  XT  >Cube-Connected-Cycle  (CCC).  with 
s  -  2°.  t  =  2r,  s  ^  r  is  a  network  of  N  =  :t  =2"  modules  and  can  be  conveniently  thought  of 
as  an  s  xt  array  of  processors  Pj  (0  ^  i  <s ,  0  ^  j  <  t )  arranged  as  a  martrix  where  j  grows 
from  left  to  right  (as  usual),  whereas  i  grows  from  bottom  to  top.  (Figure  5.9  illustrates  a  4x8  CCC.) 
The  CCC-processor  P,J  has  number  h  —  j  2°  +  i  and  corresponds  to  the  cube  processor  P  (h  ).  It  fol¬ 
lows  that  in  the  CCC  the  original  cube  indices  are  arranged  in  column  major  order.  The  columns  of 
the  s  xr  array  are  connected  as  cycles,  with  an  edge  between  PJ  and  .  The  first  r  rows 

(0  <  i  <  r)  are  associaed  wih  the  r  highest  dimensions  of  the  cube;  specifically  row  i  contains  an  edge 
between  each  pair  of  processors  who  number  differ  exactly  in  bit  position  v  —  r  + 1 .  The  dimensions  of 
the  cube  are  then  divided  into  two  groups:  the  cycle  dimensions  which  pertain  to 

interactions  between  pairs  of  elements  in  the  same  cycle  and  the  lateral  dimensions  which  pertain  to 
interactions  between  pairs  of  elements  of  the  same  row. 


For  a  gives  N  =  2",  by  chosing  s  in  the  range  [ logN ,  vN]  we  can  achieve  a  perfomance 
AT2  *  0(t2s2)  =  O (N 2)  for  any  computation  time  7  €[fi(ZogN  ),0('/FF  )J. 

Comparison  of  Cube  Emulators.  The  performance  of  the  emulators  of  the  cube  we  have 
described  for  the  Ascend  and  the  Descend  paradigm  is  summerized  in  Table  5.1.  The  shuffle-exchange, 
the  linear  array  and  the  mesh  all  have  the  same  AT2  -  (KN2)  performance,  which  is  optimal.  In  the 
following  chapters  we  usually  deploy  cube-connected-cycles  for  the  execution  of  Ascend  and  Descend 
algorithms;  especially  because  of  its  area-time  trade-off  feature  that  allows  to  choose  the  value  of  the 
computation  time  from  a  wide  range.  It  must  also  be  said  that,  for  (7  *  O  >/aT  X  the  mesh  is  usually 
preferable  for  the  simplicity  of  its  interconnection.  For  7*0  (log  N  X  the  shuffle-exchange  is  attrac¬ 
tive  for  the  elegance  of  the  emulation  algorithm.  However,  the  optimal  layout  of  the  shuffle-exchange  is 
very  irregular,  which  is  not  a  desirable  feature  for  VLSI  systems. 

The  linear  array  is  indeed  a  poor  emulator  of  the  cube,  at  least  when  judged  by  its  area-time  per¬ 
formance.  However,  it  is  very  useful  as  a  component  of  more  complex  networks,  as  we  have  already 
seen  in  the  case  of  the  rectangular  mesh  and  of  the  CCC 


TABLE  5.1.  AREA-TIME  PERFORMANCE  OF  CUBE  EMULATORS 


Performance 

.Architecture 

TIME 

AREA 

1 

Suffie-Exchange 

7  *  ettogN ) 

A  =  Q(.N  2  IT2) 

Linear  Array 

7  *  9(N ) 

A  =  Q(N  3/7  2) 

Square  Mesh 

7  =  Qi'fN  ) 

A  =  ©(N2/72) 

CCC 

7  €  [  Cl(logN)\0('JW )] 

A  *  ©(.V  2/T  2) 

5-3.2  The  Tree  and  the  Orthogonal  Trees 


The  binarx  tret.  Several  computations  require  the  .V-fold  replication  of  a  given  data  item,  or  some 
Wind  of  combination  of  A’  distinct  data  items  to  generate  a  single  one.  These  operations  are  efficiently 
executed  in  9{logN  )  steps  by  a  fully  balanced  binary  tree  with  A  *  2V  leaves  and  AM  internal 
nodes.  This  graph  can  be  laid  out  in  9(N  )  area  if  there  is  no  constraint  on  the  placement  of  the  leaves, 
and  in  9(  S'logN  )  area  the  leaves  must  be  placed  on  the  boundary  of  the  layout  region  [BK80]. 

The  orthogonal  trees.  (Refer  to  Figure  5.10)  The  two-dimensional  orthogonal  tree  network  (OT)  [LSI, 
\MB83]  consists  of  N  «  n 2  processors  P,1  ( i,j  -  0,1,  .  .  .  ,  n-l),  and  2n  fully  balanced  binary  trees 
CT  ,  .  .  C7„  (the  column  trees),  and  RTi),...,STn^l  (the  row  trees).  The  leaves  of  CTy  are  then 
processors  P, JV-i  and  the  leaves  of  ST,  are  the  processors  P,0,....  P,"  ~l. 
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Figure  5.10.  Orthogonal-tree  network  for  N  *16. 


Optimal  layouts  of  the  OTs  have  area  A  *  (X#  log2# )  [LSlJ.  Algorithms  consisting  of  a  con¬ 
stant  number  of  replication  and  combination  operations  along  the  row  and  the  column  trees,  are  exe¬ 
cuted  by  the  OTs  in  time  T  =  O  (log#  ). 

The  OT  network  is  very  versatile  and  will  be  used  is  several  of  our  sorters.  Multidimensional 
OTs  can  also  be  defined.  An  interesting  application  of  three-dimensional  OTs  to  matrix  multiplication 
can  be  found  in  [PVSO]. 


CHAPTER  6 


OPTIMAL  VLSI  SORTERS 
FOR  KEYS  OF  LENGTH  k  =logn  +e(logn) 

6.1  INTRODUCTION 

In  most  of  the  investigations  on  VLSI  sorting  the  length  of  the  keys  has  been  assumed  to  be  of  the 
form  k  a  )  log  n,  for  some  constant  a  >  0.  Since  the  results  of  those  investigations  are  indeed 
valid  as  long  as  (1+aj)  log  n  <  (1  +  a2)  log  n,  for  some  constants  a2>a,  >0,  it  is  slightly  more 
appropriate  to  refer  to  a  length  of  the  form  A— log  n  +  fl(log  n),  not  to  suggest  that,  say.  fc-2  log  n  +  log 
log  n  is  excluded  by  our  considerations. 

In  retrospect,  we  can  jusify  the  attention  given  to  the  case  /Ml-*-  a  )  log  n  for  two  reasons:  (l)this 
case  of  the  sorting  problem  is  the  easiest  (or  the  least  difficult)  to  analyze,  and  (2)  its  complete  solution 
is  instrumental  to  make  progress  on  other  cases.  While  the  second  reason  will  be  substantiated  only  in 
Chapter  7,  where  we  will  show  that  a  sorter  of  key  with  *  -  (1+  a  Hog  n  is  a  useful  building  block  or 
sorters  of  both  short  and  long  keys,  the  first  reason  can  be  already  explained  on  the  basis  of  the  lower 
bound  results  of  Part  L 

In  fact,  while  short  and  long  keys  have  to  be  studied  with  the  more  sophisticated  square- 
tessellation  technique,  the  case  ;Ml->-  a  )  log  n  -  which  partially  overlaps  with  medium-length  keys  (a 
<  1),  and  partially  with  long  keys  (a^l)  -  can  be  analized  bv  the  bipartition  technique  f  although.  as 
we  have  observed  in  Section  4.2,  the  dependence  of  the  complexity  on  a  ,  for  a  >  1,  can  be  really 
understood  only  by  the  square  tessellation  bound). 

Indeed,  from  Theorems  4.14,  4.15,  4.16,  and  the  assumption  k  >  (l+aii)  log  n  for  some  a,  ,  >0. 


we  obtain 
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AT  2  =  fi(n  -  log-  n  ) 

6.1 

— 

T  *  ClUogn ) 

62 

A  =  Mrdogn  ) 

6.3 

In  this  chapter,  we  will  study  several  VLSI  sorters,  the  analysis  of  which  will  allow  the  conclu¬ 
sion  that  the  optimality  curve  of  the  (n,  log  n  +  0  (log  n))-sorting  problem  is  described  by 

A  *  flCn^og-Ti/T2),  T  G[Q(logn\ OC'Jnlogn  )\  6.4 

As  we  shall  see,  the  main  difficulty  in  obtaining  optimal  VLSI  sorters  for  -  logn  +  9  (log n)  con¬ 
sists  in  designing  the  appropriate  architecture.  For  the  algorithms  instead,  it  will  be  sufficient  to  resort 
to  (minor  adaptations  of)  the  known  results  reviewed  in  Section  52.  The  situation  will  be  different  for 
short  and  long  keys;  which  will  require  new  algorithms  as  well  as  new  architectures. 

This  fact  is  not  without  explanation.  The  first  parallel  sorting  algorithms  have  been  conceived  for 
either  the  shared-memory-machine  or  the  network -of-compara tors  models  of  computation,  whose  prim¬ 
itive  operations  are  at  the  word  level.  Thus,  the  keys  were  treated  as  indivisible  entities  that  maintain 
their  identity  throughout  the  entire  computation. 

The  indivisibility  of  the  key  is  a  very  restrictive  constraint  in  the  VLSI  model  of  computation, 
and  conflict  with  area-time  optimality. 

Indeed,  to  $ort  short  keys  it  is  not  convenient  to  maintain  a  list  encoding  of  the  input  multiset  in 
the  intermediate  stages  of  the  computation,  because  it  is  very  inefficient,  and  requires  superfluous 
bandwidth  in  transmission.  Thus,  no  algorithm  that  maintains  the  identity  of  the  keys  can  achieve 
optimality. 

The  same  conclusion  is  true  also  for  rather  long  keys  (logn  -  o(<))  but  for  a  different  reason.  The 
list  representation  is  indeed  efficient  in  this  case,  but  we  still  need  to  fragment  the  keys  to  avoid  a  large 
primary  flow.  In  fact,  we  have  already  seen  that  even  when  the  "indivisibility"  of  keys  is  required  only 
at  the  I/O  ports  (word-locality)  the  AT 2  complexity  of  (nJc )-sorting  is  asymptotically  quadratic  in  k 
(Theorem  4.14),  while  without  this  restriction  the  complexity  is  only  linear  in  k  (Theorem  4.18). 


The  case  k  -  logn  +  9  (logn)  is  made  special  (and,  superficially,  simpler  than  others)  by  two  cir¬ 
cumstances.  One  is  that  the  list  encoding  is  optimal  for  this  length  (within  a  constant  factor).  The  other 
is  that  any  strategy  to  decrease  the  primary  flow  below  9  ( nlogn )  by  suitably  decomposing  the  keys 
would  fail  to  yield  better  area-time  performance  due  to  the  presence  of  an  irreducible  6  (nlogn )  secon¬ 
dary  flow. 

In  conclusion,  for  k  -  logn  +  tXlogn )  the  indivisibility  of  the  keys  (word-locality)  is  not  a  draw¬ 
back,  and  classical  sorting  algorithms  turn  out  to  be  instrumental  to  obtain  area-time  optimal  designs. 

With  this  premise,  w  now  turn  our  attention  to  the  effective  construction  of  VLSI  sorters.  Many 
designs  have  been  proposed  in  the  early  literature,  and  reference  [T83]  surveys  several  of  them.  Here, 
we  recall  only  the  designs  that  come  closer  to  AT2  =  H(n  2log2n )  lower  bounds,  which  are: 

(i)  a  mesh-connected  bitonic  sorter  dBa68l  [T80l  [TK771  [\S79D  with  optimal  performance 

A  -  0(n2log2n  /T2)  at  T  -  (X'/n),  and  other  four  designs  all  with  suboptimal  performance 

.4  =  9(n  :log4n  IT  2)  which  are:  , 

(li)  a  shuffle-exchange  bitonic  sorter  at  T  -  9Uog  3n  )  dSt7l]lT80]iKLLM83]) 

(iii)  a  cube-connected-cycles  bitonic  sorter,  for  T  €[fl(iog3n),  O  nlogn  )]([PV8lD 
Civ)  a  pipelined  Batcher's  network  .at  T  -  9(log  2n  )  (\"LS1  estimate  in  [T83j) 

(v)  an  orthogonal-tree-connected  dL8ll  [NMB83])  Muller-Preparata  sorter  [MP75]  at  T  ~  9(logn  ). 

The  (Klog  2n  )  gap  between  lower  and  upper  bound  for  .47  2  exhibited  by  the  last  four  designs  has 
indeed  been  one  of  the  original  motivations  for  the  work  reported  m  this  thesis.  Tne  remainder  of  this 
chapter  is  devoted  to  the  discussion  of  VLSJ  sorters  with  optimal  performance.  For  several  of  them,  a 
description  is  already  available  in  the  literature. 

Section  6.2  is  devoted  to  bitonic  sorting.  Our  approach  wiil  consist  in  focussing  on  the  underlying 
paradigm,  which  is  of  the  cube  type,  but  more  complex  than  the  Ascend  or  the  Descend  paradigms,  and 
in  constructing  efficient  architectures  for  that  paradigm. 
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Section  6.3  is  devoted  to  merge-enumeration  sort,  which  allows  to  achieve  minimum  computation 
tune  T  »  OOogn).  .New  architectures  are  also  considered,  based  on  a  mixing  of  orthogonal  trees  and 
cube-connected-cycles. 

Section  6.4  concludes  the  chapter  with  a  brief  report  on  other  area-time  optimal  networks  imple¬ 
menting  multiway-shuffle  sort  and  the  AKS  algorithm. 

6.2  NETWORKS  FOR  BITONIC  SORTING 
6.2.1  The  Bitonic  Sorting  Paradigm 

The  bitonic  sorting  of  n  =  2 f  elements  (reviewed  in  Section  5JL2)  consists  of  v  merging  phases, 
M  oA 1 1, . . .  M  x~.i ,  with  phase  M,  performing  the  merging  of  pairs  of  sequences  of  length  2'  . 

Bitonic  merging,  on  the  other  hand,  complies  with  the  Descend  paradigm  of  the  binary  cube 
[PV8laL  »  that  the  execution  of  phase  M,  on  the  binary  cube  requires  the  successive  use  of  dimensions 
£,  JET,  i,£0  •  Thus,  the  schedule  of  use  of  the  dimensions  for  a  complete  sorting  is  the  one  shown 

below  (Figure  6.1),  which  will  be  called  the  bitonic  sorting  paradigm. 


phase  i  active  cube  dimension 


V/„ 

.y. 


E{  E„ 

Ei  Ei  E,x 


V/,_,  :  E. 


Figure  6.1.  The  bitonic  sorting  paradigm. 


We  shall  now  analyze  the  performance  of  some  of  the  known  emulators  of  the  cube  on  the 
bitonic  sorting  paradigm,  and  we  shall  then  propose  new  and  more  efficient  emulators. 

Some  considerations  will  put  the  problem  in  the  proper  perspective,  and  will  indicate  the 
difficulties  to  be  overcome  for  its  solution. 

The  area-time  performance  of  the  emulators  of  the  cube  reported  in  Section  5.3.1  pertains  to  the 
Ascend  and  Descend  paradigms.  The  area  is  estimated  under  the  assumption  that  links  between  proces¬ 
sors  are  realized  with  unit  bandwidth,  so  that  they  can  be  laid  out  in  unit  width,  and  the  computation 
time  is  estimated  under  the  assumption  that  both  operation  and  transfer  steps  take  unit  (or  constant) 
time. 

When  the  actual  data  processed  by  the  algorithm  have  length  k,  the  estimate  on  the  AT2  measure 
must  be  multiplied  by  k2  ,  although  we  can  usually  still  choose  whether  the  penalty  is  to  be  paid  in 
area  or  in  time.  In  fact,  for  a  given  b  with  1  <6  ^  k  ,  if  we  realize  all  the  links  with  bandwidth  b 

we  obtain  an  area  A*  *  b2A  i ,  and  -  usually  -  a  time  Tb  *  j-T  „  so  that  Ab  Th2  ** k:A  ,7 . 

The  reason  for  which  we  cannot  claim  that  Tb  always  equals  ( k/bYT ,  is  that,  although  a  larger 
bandwidth  automatically  yields  a  proportional  speed-up  in  transfer  steps,  it  does  not  guarantee  a 
speed-up  in  operation  steps.  However,  most  of  the  time  operation  steps  can  be  performed  in  k/b  time,  at 
least  as  long  as  k/b  is  not  small.  For  example,  in  sorting,  the  operation  step  usually  involves  a 
comparison-exchange,  which  can  be  performed  in  time  k/b  (and  area  b2  )  as  long  as  k  lb  =  ClUog  k  ), 
or  equivalently,  b  -  Oik.  log  k ). 

Thus,  the  optimal  emulators  of  the  Ascend  and  Decend  paradigms,  achieve  a  performance 
AT 2  -  0(S  2k 2)  on  operands  of  length  k.  If  jV  -  n,  and  k  ■  logrt  +  9  ( logn ),  AT 2  —  O  (n  :log:n  ). 

In  order  to  attain  the  .47 :  -  Q(n  :log:n  )  lower  bound  for  (n  ,  logn  +  (Klogn )  )-sorung  by 
means  of  the  bitonic  algorithm,  we  must  be  able  to  execute  the  entire  bitonic  sorting  paradigm  in  the 
same  order  of  time  as  the  much  simpler  Descend  paradigm.  It  is  indeed  surprising  that  this  is  possible. 


6.2.2  The  Linear  Array 
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Although  the  linear  array  is  far  from  being  an  optimal  emulator  of  the  cube  even  for  the  simple 
Ascend  and  Descend  paradigms,  the  study  of  its  performance  on  the  execution  of  bitonic  sorting  will 
provide  us  with  some  useful  insights. 

With  reference  to  the  discussion  of  Section  5.3.1,  we  shall  consider  a  linear  array  of  N  -  n  proces¬ 
sors.  where  n  is  the  number  of  keys  to  be  sorted.  Each  processor  will  be  endowed  with  Oik)  bits  of 
memory,  to  allow  the  storage  of  a  key  of  length  k,  and  with  a  serial  comparator-exchange  that  works 
in  Oik )  time.  We  can  then  lay  out  a  processor  in  a  region  of  Oi 1 )  width  and  Oik)  height.  The  connec¬ 
tion  between  F(i-l)  and  Pit)  can  be  realized  with  bandwidth  k,  to  allow  the  execution  of  transfer  steps 
in  one  time  unit.  The  entire  array  can  then  be  easily  laid  out  in  O (/t )  x  Oik  )  area.  (See  Figure  6.2.) 

The  running  time  of  bitonic  sorting  on  the  linear  array  is  readily  estimated.  The  operation  steps 
are  (v  +  l)v/2  *  O  ilog  2« )  in  number,  and  each  takes  Oik)  time,  so  that,  globally,  the  comparison- 
exchange  steps  take  T  t  =  O  (klog  zn  )  time.  As  for  the  data  transfer  we  recall  from  Section  5.3.1  that 
execution  of  E,  is  done  in  2'*1  —  2  steps.  Since  the  bitonic  sorting  uses  dimension  E,  exactly  v  —  j 

v—\ 

times,  globally  the  transfer  steps  take  T :  <  ^iv—j)!'*1  ~  Oin)  time.  In  conclusion,  for 

j  *0 

k  -  logn  +  dilogn  ),  T ,  is  negligible  with  respect  to  T : ,  and  the  total  sorting  time  is  T  -  Oin) . 


Figure  6.2. 


Layout  of  linear  array  for  bitonic  sorting. 


At  first,  this  is  surprising  since  we  would  intuitively  expect  that  the  bitonic  sorting  paradigm, 
which  consists  of  v(v— l)  2  steps  on  the  cube,  would  require  more  time  than  the  descend  paradigm  con¬ 
sisting  of  an  v  steps  on  the  cube.  However,  a  closer  analysis  reveals  that  the  dimensions  more  frequently 
used  by  bitonic  sorting  are  the  lowest,  which  also  happen  to  be  the  ones  that  require  less  time  on  the 
linear  array.  Obviously,  this  is  a  fortunate  accident,  but  the  principle  that  the  most  frequently  used 
dimension  should  be  the  ones  with  the  fastest  execution  will  be  a  useful  guideline  for  subsequent 
developments. 

6.23  The  Mesh 

Let  n  =  2*  and,  for  simplicity,  let  v  be  even.  We  consider  now  the  execution  of  bitonic  sorting 

with  a  ( 'fn  x  'Jn  )-mesh.  The  n  processors  of  the  mesh  will  be  equippped  with  a  serial  comparator, 

and  with  Oik)  bits  of  storage.  They  will  be  laid  out  in  an  O  (b  )xO  (6  )  region,  with  fk  <  b  <  k ,  to 
allow  bandwidth  b  for  both  the  horizontal  and  the  vertical  connections.  The  global  layout  area  is  then 

A  —  Oib2n). 

The  O(log3n )  comparison-exchange  steps  globally  take  Oiklogzn )  time,  which,  for 
k  -  logn  +  9  ( logn  ) ,  will  be  completely  negligible  with  respect  to  the  time  used  for  transfer  steps. 

As  we  have  seen  in  Section  5.3.1,  the  execution  of  dimensions  E,  and  £  w2+y  uses  21*1  —  2 

transfer  steps,  for  j  -  0,1.  ...,y/2  —  1.  A  simple  calculation  shows  that  the  total  number  of  transfer 
steps  is  O ( vn  logn ),  and  hence  the  computation  time  is  of  order  O ( 'Jn  log2n  !b  ). 

Summarizing,  if  k  »  logn  +  9(logn ),  A  =  Oibbi)  and  T  —  0( 'fn  log-rt  lb ).  so  that 
AT2  —  Oin  :Iog4n  )  which  is  within  a  factor  0  (log  -n )  of  the  lower  bound.  We  observe  that,  since 
6  €[vir  Jc  1  the  computation  time  can  be  chosen  in  the  range  T  t[Cl(fnlogn  ),0('/n  log3,2rt )]. 

By  using  a  more  efficient  implementation  of  bitonic  sorting  on  the  mesh,  [TK77]  and  [\S79]  have 
managed  to  reduce  the  number  of  transfer  steps  to  O (-fn).  Using  this  result  in  the  above  analysis  we 
obtain  the  following  theorem. 

Theorem  6.1.  Bitonic  sorting  of  n  keys  of  length  k  —  logn  +  9(logn )  can  be  executed  by  a 


(Vn*  xVn)-mesh  with  optimal  AT2  —  (Kn )  for  T  i[Q(Vn  XOC^nlogn  )J, 

The  original  description  of  the  algorithm  in  [TK77]  and  [NS79]  is  rather  complex,  but  it  can  be 
simplified  by  using  an  approach  that  focusses  on  the  paradigm. 

Indeed,  in  the  next  section  we  shall  develop  a  framework  in  which  the  optimal  algorithm  for 
bitonic  sorting  on  the  mesh  can  be  easily  obtained  as  a  specialization  of  a  general  principle. 

6.2 A  Efficient  Use  of  Cube  Emulators  for  Arbitrary  Paradigms. 

Any  emulation  procedure  by  which  a  given  graph  GMV£),  with  I V  I  =  JV  =  2*1 ,  emulates  the 
v-dimensional  binary  cube  is  based  on  a  one-to-one  correspondence  between  the  vertices  of  G  and  the 

vertices  of  the  cube,  such  that  v  €  V  corresponds  to  f  (v  )€  |0,1 . N  —1  |.  The  emulation  procedure  is 

correct  when,  if  the  processor  associated  with  v  €V  is  initially  loaded  with  the  same  input  data  A[/Tv)] 
that  in  the  cube  are  loaded  in  processor  P  f  <v>  then  upon  termination  the  processor  associated  with  v 
contains  the  same  output  data  A\f  (v )]  that  the  cube  contains  in  P  r-  r 

We  investigate  now  the  possibility  of  modifying  the  function  f  for  a  fixed  graph  G.  In  particu¬ 
lar,  let  o- (0), . . . ,cr  (v— 1)  be  a  permutation  of  the  dimensions  (0, . . . ,  v— 1),  and  let  iHO), ....  iriN  — l)  a 
permutation  of  |0, — 1)  such  that  if  h  has  the  binary  representation  hv_lhv_2...h0,then  -nih )  has 
the  representation  h Thtu*  if  h  and  h'  are  connected  by  an  edge  in  E. ,  then 
tK/i  )  and  irih ')  are  connected  by  an  edge  in£^;,We  consider  the  correspondence  between  G  and  the 
cube  defined  by  /  ^(v  )-tK  f  (v  ))  (see  Figure  6.3). 

If  for  the  pair  iG.f)  there  is  a  procedure  that  emulates  the  execution  of  dimension  E ,  in  time  T  , 
for  the  pair  (G ,  /  „•)  the  same  procedure  will  emulate  the  execution  of  dimension  E  ^  t  >  in  time  7  , . 

Given  a  paradigm  consisting  of  an  arbitrary  schedule  of  use  of  the  cube  dimensions 
(Ej  t£j }, . . .  £d%! ),  we  can  ask  for  which  the  pair  (G ,  f  „)  achieves  the  minimum  emulation  time. 

The  answer  is  not  difficult.  Let  n(j)  be  the  number  of  times  that  £;  is  used  by  the  paradigm  (/u 
is  the  multiplicity  function  of  multiset  \d  })  .  If  po.Pi . P „-i  is  the  sequence  of  the 
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fand  f  a  between  G  and  the  cube. 

c.  .  ,  Correspondences  /  and  /  «r 

Fl§UrC  K.  used  Ci*. 

t0  the  ^  frequent  — 

a—  «—  ^  w  ,  *  ^  ■” 
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respect  to  a  column  *a«r  number^  o 
execution  times  are  gJveQ  b-’ 

_  -*«•»- 2.  for  e  *  0,1 . v'2  1 

T,  =  ’  r'2-»  ~  ' 


Thus. 
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7*  o  =  7*  w2  ^  7"  i  =  7'  „/2+i  <  •  •  <  T  „/i-i  —  7*  ,,_i 
and  (g  (v?  lt . . .  jq  ^_,)  =  (0,v/2,lj//2+l . v/2— l.v— 1). 

For  the  bitonic  sorting  paradigm  (£  ,>£  ,,£ . . .  \E  *_lt . . .  JE  0)  dimension  £,  is  used  fiiq  )  =  v—q 

times,  so  that /t(0)  >  /i(l)  >  •••  >  /i(v—  1)  and  (poj>i . Pv-0  -  (0,1 . v—  1).  From  Hq.  6.6,  the 

permutation  cr  that  minimis  the  emulation  time  is 

or  =  (<t<0), ....oKi/— l)  )  =  (0.W2.1.W2+1 . v/2-l.v-l). 

For  the  emulation  time,  Eq  6  yields 

7"  <t  =  Z  fAPhTT,' 

A  *0 

W2-J 

*  Z  (m(/>2A  yriv,  +M(^2A+l)7'fM.1) 

A  *0 

v/2-1 

=  £  (Mt'-2h  yrh  +  m(v-2a  -nr  w2«*  ) 

A  *0 
W2-1 

=  £  (2v-4^-lX2/,+l-2) 

A  SO 

=  0(2W3)  =  0(v/JT ). 

The  permutation  cr  and  the  numbering  f  „  are  illustrated  in  Figure  6.4,  for  v  -  4.  In  general,  if  i 
and  j  respectively  have  binary  representations  j  vn-\j  vn-y  "  j  \  j o  and  t y/2-i^ w:-2  "  *  ‘  i1’ o.  then  h  - 
/Uy)  -  2 w2y  +i  has  the  binary  represenution  j  2_(  •  •  •  2-1  •  •  •  to  and  f  JiiJ)  =  tri/i  )  has  the 

binary  representation  j „/2_,i „/2_1  j n* .>  □ 

The  7  0  Format.  Usually  the  format  of  the  input  array  .4(0],  ....  A[n-/1  and  of  the  output  array 
.4  tOl ...  ^1  t-V  — l]  are  imposed  a  priori  on  our  emulator  G  by  global  system  considerations,  and  must  be 
consistent  with  the  I/O  format  of  other  parts  of  the  system  that  are  interfaced  with  G.  As  a  conse¬ 
quence,  at  the  end  of  the  input  phase,  the  data  may  be  loaded  into  the  processors  of  G  in  an  order  that 
differs  from  the  one  required  by  the  emulation  algorithm.  A  similar  situation  might  occur  for  the  out- 
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Figure  6.4.  Column-major  numbering  f  ,  and  optimal  numbering  /  for  the  bitonic  sorting  para¬ 
digm  on  a  (4x4)  mesh. 


put  data. 

In  such  a  situation  the  emulation  procedure  must  be  preceded  and  followed  by  a  suitable  permu¬ 
tation  of  the  data.  This  problem  can  be  solved  by  resorting  to  the  Benes  permutation  algorithm. 

Originally  formulated  for  a  network  of  switches  [Be64],  Benes*  algorithm  can  be  also  cast  in  a 
cube  paradigm  consisting  of  a  Descend  followed  by  an  Ascend  [PVSla].  The  schedule  of  use  of  the 
dimensions  is  then 

(£  .._i,  E  ...£  ifE  <*£  v-yE  y_i). 

At  each  operation  step,  a  pair  of  processors  connected  along  the  active  dimension,  may  or  may  not 
exchange  their  data,  according  to  the  value  of  a  control  bit.  Each  permutation  of  size  21'  can  be  realized 
by  this  algorithm,  by  a  suitable  choice  of  the  control  bits. 

Although  the  parallel  computation  of  the  control  bits  for  an  arbitrary  permutation  is  not  a  simple 
task,  m  the  application  we  are  considering  the  permutations  of  data  to  be  realized  are  known  at  design 
time.  Thus,  the  control  bits  can  be  precomputed  and  stored  in  the  processors,  provided  that  each  proces¬ 
sor  is  endowed  with  O(v)  =  0  (iog.V  )  storage. 


Is  conclusion,  by  adding  to  the  computation  time  the  usually  negligible  overhead  corresponding  to 
a  constant  number  of  Ascend  and  Decend  algorithms,  any  (a  priori  known)  I/O  format  can  be  combined 
with  any  correspondence  of  processors  between  the  cube  and  the  emulator. 

The  Benes  permutation  algorithm  could  be  exploited  to  build  emulation  procedures  more  sophisti¬ 
cated  than  the  one  described  in  this  section.  In  fact,  for  a  paradigm  in  which  the  frequency  of  use  of 
the  dimensions  is  strongly  time  dependent,  it  may  be  convenient  to  dynamically  change  the  allocation 
of  the  dimensions  during  the  execution  of  the  algorithm.  This  approach,  however,  will  be  not  further 
pursued  in  this  thesis  (being  inapplicable  to  the  sorting  problem). 

6.2.5  The  Cube-Connected-Cycles. 

We  have  seen  that  the  mesh-connected  bitonic  sorter  is  area-time  optimal  for  slow  computation. 
To  obtain  faster  sorters  we  turn  our  attention  to  the  CCC  which  we  already  know  to  be  optimal  for  the 
Descend  paradigm  in  a  wide  range  of  computation  times. 

As  we  have  seen  in  Section  5.3.1.  in  an  (s  xr  )-CCC  (s  =  2*V  =  2V  ^  tji  =si  =  2*0  the  cube 
dimensions  are  naturally  divided  into  two  groups  the  cycle  dimensions  £  „ . . .  JZ  and  the 

lateral  dimensions  £ „£ JE <r+r-i  (<t+t  -  v).  A  cycle  dimension  £,  (0  ^ y  ^  o’— 1)  is  exe¬ 
cuted  in  T j  -  2;”1— 2  transfer  steps.  A  lateral  dimension  £„.„  ,  (0  <  y  <  r  —1)  is  executed  by  pipe¬ 
lining  the  data  around  the  complete  cycle,  and  therefore  uses  uses  *  s  transfer  steps.  However, 
when  all  the  lateral  dimensions  have  to  be  executed  consecutively,  0(sl  transfer  steps  are  sufficient 
(rather  than  O(rs))  because  the  pipelined  mode  of  operation  allows  to  overlap  the  execution  ot 
different  dimensions. 

In  a  paradigm  like  bitonic  sorting,  where  in  several  merging  phases  only  some  of  the  lateral 
dimensions  are  executed,  the  emulation  procedure  becomes  inefficient.  In  fact,  each  *  the  last 
r  -  0 ( logn  )  merging  phases  St „V/ requires  the  use  of  a  set  of  cosecutive  lateral  dimensions 
(more  specifically,  E are  used  during  phase  A/ff+l),  and  therefore  takes  Ois)  transfer  steps. 
Thus,  the  CCC  executes  the  sorting  paradigm  in  0  (ts  )  •  Otslogn)  transfer  steps. 
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We  recall  from  the  discussion  of  Section  6.2.1  that  to  attain  the  lower  bound  for  sorting  we  need 
to  keep  the  number  of  transfer  steps  in  the  bitonic  sorting  paradigm  of  the  same  order  as  in  the  Descend 
paradigm,  which,  for  the  CCC,  is  Oisl. 

At  first,  the  problem  might  seem  similar  to  the  one  already  encountered  for  the  mesh,  where  we 
have  reduced  the  number  of  transfer  steps  from  Oi^nlogn  )  to  )  by  a  more  appropriate  alloca¬ 

tion  of  the  dimensions.  However,  for  the  CCC  we  face  a  more  difficult  situation  because  the  dimensions 
already  obey  tw*  — ..^<le  that  the  most  frequently  used  ones  are  those  with  the  least  execution  time. 
The  problem  is  that  most  of  the  dimensions  have  a  high  execution  time. 

Thus,  a  fast  and  efficient  execution  of  the  bi tonic  sorting  paradigm  requires  the  development  of 
new  networks.  In  the  next  two  sections  we  shall  describe  the  pleated-cube-connected-cycles,  and  the 
mesh  of  cube-connected-cycles,  and  show  that  they  are  area-time  optimal  emulators  of  the  bitonic  sort¬ 
ing  paradigm. 


6.2.6  The  Pleated -Cube-Connected-Cycles  (PCCC) 


Description.  There  is  a  basic  observation  that,  when  recursively  applied,  leads  to  the  modification  of 
the  CCC  into  the  PCCC,  and  to  a  performance  gain.  The  informal  argument  goes  as  follows:  for  any 
given  integer  0.  the  highest  £  dimensions  E „_s  are  used  only  during  the  last  0  merging 
phases.  We  could  then  depoly  2fl  'small"  CCCs,  each  with  n  /2a  processors,  to  execute  the  first  v—Q 
phases  of  merging  in  parallel,  and  subsequently  supply  the  intermediate  results  to  a  "large"  CCC,  with 
n  processors,  to  complete  the  execution  of  the  sorting  algorithm.  The  advantage  of  this  strategy  is  that 
the  smaller  machines  have  short  cycles,  and  work  faster,  while  a  large  CCC  would  have  to  use  its  full 
cycle  length  in  all  stages  of  the  algorithm.  The  transfer  of  results  from  the  small  CCC  to  the  large 
CCC  can  actually  be  accomplished  with  no  data  movement  by  simply  reconfiguring  the  network. 
Indeed,  the  reconfiguration  of  the  2 A  small  CCCs  to  the  large  one  can  be  realized  by  suitably  embedding 

the  former  into  the  latter.  We  first  lav  out  the  2“  ( 4-  * — r — hCCCs  in  a  2x2fi-1  array  (see  Figure  6-5). 


I 
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.Note  (again  refer  to  Figure  6-5)  that  each  cycle  of  a  CCC  in  the  lower  tier  faces  at  its  upper  end  the 
lower  end  of  a  cycle  in  an  upper  tier  CCC  Next  we  modify  the  layout  by  merging  each  pair  of  facing 
cycles  into  a  single  cycle  with  *  processors  and  by  using  the  last  available  0—1  rows  of  the  top  tier 
CCCs  to  realize  the  lateral  connections  for  £  *-$-!, . . .  JE  (see  Figure  6.4).  We  obtain  an  *  xt  array, 
where  in  each  column  we  have  to  provide  a  suitable  switch  to  reconfigure  the  two  original  length-*/ 2 
cycles  into  a  single  length-*  cycle. 

in  the  machine  we  have  just  described,  the  highest  0  -1  dimensions  are  lateral,  but  the  0  -th  one  is 
a  cycle  dimension.  The  next  r  —  0  +  1  dimensions  are  again  lateral  and  the  remaining  (<r  —  1)  ones 
are  cycle  dimension.  For  the  first  v  —  0  merging  phases  of  the  sorting  paradigm,  the  2*  CCCs  are 
decoupled  and  work  in  parallel.  Consider  now  phase  M  (l  <  a  <  0),  which  corresponds  to  the 
execution  of  the  sequence  Jz0  .  The  (lateral)  dimensions  £„_*,. ...£,,_fl  + ,  and  the  cycle 

dimension  £„_fl  are  executed  using  the  full  cycle  length;  next,  the  cycles  are  reconfigured  to  half 
length,  and  Ev _  j, . . . -£0  are  executed  in  the  "small"  CCCs. 

Before  proceeding  further,  we  consider  the  permissible  values  of  the  parameters  0.  *,  and  t.  Since 
the  top  */2  rows  of  the  full  network  must  support  r  lateral  dimensions,  (Le,  in  each  cycle  there  must 
be  at  least  one  processor  per  dimension),  we  have; 

r  <  i.  6.7 

Since  (0  —  1),  the  number  of  lateral  dimensions  connecting  the  "small"  CCCs  in  Figure  6.5,  must  satisfy 
1  <  (0  —  1)  ^  r  we  trivially  have; 


2  *  J3  <  t  +  1. 


6.8 


We  shall  now  complete  the  modification  of  the  CCC  by  fully  exploiting  the  key  idea  which  led 
to  the  network  of  Figure  6-5.  This  is  achieved  by  defining  as  an  (*  x:  )-piecued  CCC  (PCCC)  the  net¬ 
work  of  Figure  6-5  where  each  of  the  23  component  networks  is  itself  a  recursively  defined 

( -ix— L— )-PCCC  (rather  than  a  conventional  CCC).  Note  that  0  is  a  design  parameter. 


Figure  6.5.  Construction  of  the  PCCC.  (a)  .Arrangement  of  26  independent  small  CCCs:  (bJ  their  in¬ 
terconnection  to  form  the  CCC. 

.An  (s  xt  >-PCCC  with  given  8  can  be  viewed  as  an  s  xr  array  of  processors  P>  fas  before. 
s  =2<T,  r  =2r,  5  ^r,  <  r .  ;  <; )  whose  columns  are  organized  as  reconfigurable  cycles,  and 
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whose  rows  support  lateral  connections.  Starting  from  the  highest  dimension  lateral  and  cycle 

dimensions  are  interleaved,  (0—1)  to  one.  This  interleaving  continues  until  we  reach  cycle  dimension 
E  ^k6,  where  the  cycle  length,  2<r~\  is  still  adequate  to  accommodate  the  r  lateral  dimensions.  Thus, 
from  the  condition  2<T“X  ^  r  we  obtain 

A  <  u  —  log2fJ  6w9 

and  A  is  called  the  depth  of  interleaving.  At  this  point  the  remaining  v  —  A0  dimensions  are  assigned 
as  follows:  the  higher  r  —  (0  —  1)A  ^  0  are  lateral,  and  the  lower  cr  —  A  are  cycle  dimensions.  Obvi¬ 
ously  we  have 

0^1+1,  6.10 

and  the  cycles  are  reconfigurable  to  any  length  2y  where  cr  —  A  $  y  ^  cr  .  Note  that  there  are 
2A  —  1  reconfiguring  switches  per  column. 

An  (8xl6)-PCCC  with  0  a  3  is  illustrated  in  Figure  6.6.  Notice  that  conditions  6.7  and  6.8  are 
automatically  satisfied.  Moreover,  from  6.9  the  maximum  permissible  value  of  A  is  1,  whence  condition 
6.7  is  confortably  satisfied.  Incidentally,  note  that,  for  the  same  value  0  =  3,  the  smallest  PCCC  with 
A  =  2  has  256  processors  (r  =16.5  =16,0  =  3). 

Due  to  the  above  interleaving  of  dimensions,  processor  P,J  in  the  PCCC-arrav  corresponds  to 
cube-processor  Ph ,  where  h  =  j2a  +  i  and  A'  is  the  integer  obtained  by  permuting  the  binary 
representation  of  h  according  to  the  above  interleaving  scheme.  This  is  illustrated  in  Figure  6.7. 

Performance  . Analysis .  In  this  section  we  give  an  upper  bound  to  the  area  of  the  PCCC  and  to 
the  time  used  to  execute  bitonic  sorting.  The  PCCC  is  to  be  laid  out  in  the  rectangular  grid.  We  will 
assume  that  a  PCCC  processor  is  endowed  with  a  serial  comparator  and  Oik)  bits  of  storage  so  that  it  fits 
in  an  0(l)x0(k  )  area.  We  also  assume  that  edges  have  unit  width.  Data  transmission  takes  place  in 
serial  fashion.  Then  the  width  of  the  ( s  Xf  )-PCCC  is  easily  seen  to  be  Oitl,  if  we  lay  out  each  cycle  in 
a  constant  number  of  vertical  tracks.  It  is  easy  to  see  that  an  array  row  associated  with  the  a—th 
highest  lateral  dimension  ( a  —  1 . r)  (no  matter  what  is  the  index  of  the  dimension  in  the  cube)  is 


by.fl  tvzs 


Figure  6.7.  Bit  position  permutation  induced  by  the  pleating  scheme  (the  arrangements  of  the  h ' 
and  h  are  above  and  below,  respectively.) 


laid  out  with  r  /2“  tracks.  When  we  consider  the  multiplicity  of  each  dimension,  Le.  the  number  of 
rows  associated  with  it  as  a  result  of  "pleating*,  we  obtain  the  following  formula  for  the  height  of  the 
PCCG 


height  —  t  —  + 


+  2  + 
•>& 


1 


•  +  2*~* 


+  •  •  •  +  _  ||  +  Oiks  X 


Bv  evaluating  this  sum,  for  2-2  we  have  height  *  O  (kt  +  ks ),  while  for  &  >  2  we  have 
height  *  O  (t  +  ks )  .  Since  in  the  case  &  >  2  the  height  does  not  depend  upon  \  ,  hereafter  we 
further  restrict  3  to  be  ^  3.  Moreover,  we  add  the  condition  s  ^  -Jn  ik  ,  so  that  height  «  CHt).  We 
then  conclude  that  .4*0  (width  xheight )  a  (Xr :). 
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The  analysis  of  the  computation  time  requires  some  additional  discussion.  A  cycle  dimension  asso¬ 
ciated  with  arrays  of  length  l  uses  (Z-l)  steps,  by  the  technique  explained  in  Section  5.3.1.  The  execu¬ 
tion  of  a  set  of  consecutive  lateral  dimensions  on  a  cycle  of  length  l  requires  no  more  than  4 1  steps  (2Z 
comparison  exchanges  and  21  shifts  of  the  cycles).  For  our  convenience  we  consider  first  the  highest  A3 
dimensions,  and  we  group  them  in  A  sets  of  size  3  ,  which  are  executed  0.2/S. ....A3  times  respectively. 
Since  the  execution  of  each  set  requires  $  41  steps,  with  l  -  r  /2‘  ,  for  the  i-th  group  (t  =  0. ...  A~l) 
we  can  upper  bound  the  total  number  of  steps  r,  for  these  dimensions  as  follows: 


*1 


4r  0 


<  16s  0. 


The  remaining  dimensions  are  handled  by  a  set  of  (-£_x-i-)-CCCs,  which  have  a  cycle  length  (Klogn  ) 

2K  2BK 

(since  —  =  2 1  °*2  I  =  (Xt)  =  (Klogn ) ) .  Thus,  referring  to  the  well-known  behavior  of  the  conven- 

tional  CCC.  we  know  that  the  entire  set  of  the  last  v— A3  dimensions  can  be  executed  in  CXlogn)  steps. 
On  the  other  hand,  the  entire  set  is  executed  Odogn)  times,  thus  they  globally  require  t2  ~  O  ( log  ~n  ) 
steps.  Finally,  recalling  that  k  is  the  operand  length,  the  total  computation  time  T  is  given  by 
Htj  +  r:)  =  0(A(s  +  log-n  ) ),  and.  for  s  »  Cl(log2n  ),  we  have  T  -  0<~sl. 

We  can  summarize  the  preceding  discussion  as  follows. 

Theorem  6.2.  The  pleated  CCC  can  son  keys  of  length  k  in  time  T  -  Oiks)  and  area 
A  =  0(n  2 Is  2)  for  any  s  in  the  range  [  0  (log2n)jO(  'Jn  Ik  )].  For  k  —  logn  +(Klogn  ),  the  performance 
is  .AT  -  =  (Kn  2log2n  )  for  T  €[  CMlog  2n\  0(  •Jrdogn  ) J. 

The  PCCC  has  been  first  proposed  in  [BP84aI  where  a  detailed  description  of  the  control  structure 


is  also  given. 


6.2.7  The  Mesh-of-CCC 


Another  network  that  can  execute  the  bitonic  sorting  paradigm  with  the  same  area-time  perfor¬ 
mance  of  the  PCCC  is  the  mesh-of-CCC  (MCCC),  a  suitable  "hybridization"  of  mesh  and  CCC. 

An  (NjnMCCC,  with  N  *  2\  m  *  2“  .  and  r  A  N  /m3  *  2r  (r  =  v  -  2 n)  consists  of  m2  CCC 
modules,  each  with  t  cycles  of  length  r  .  The  Nm  processors  of  the  MCCC  are  conveniently  indexed  as 

pjf  :  O^i.y  <  m,  0^  p  <  r,  0  £  q  <  i-  6.11 

For  a  fixed  pair  (ij)  the  set  \P/f  :  O^p  <  r, 0  £q  <  t }  is  connected  as  an  (rxr  )-CCC,  and,  for  a  fixed 
q ,  the  set  of  processors  \P,Jjf  :  0  ^  i,j  <  m }  is  mesh  connected  (with  i  and  j  as  row  and  column 
indices,  respectively). 

The  MCCC  graph  can  be  laid  out  in  a  square  of  area  A  -  OiN  2/m 2),  since  each  CCC  requires 
0(N2/m  *)  area,  and  channels  of  width  0(N2/m 2)  allow  a  straightforward  implementation  of  mesh 
connections. 

We  discuss  now  the  properties  of  the  MCCC  as  an  emulator  of  the  binary  cube.  To  avoid  possible 
confusion,  let  us  immediately  say  that  the  CCC  modules  will  not  be  deployed  in  the  standard  mode 
described  in  Section  5.11.  In  fact  a  (rxr  >-CCC  will  be  used  to  process  only  t  (rather  than  rxr )  data 
items.  This  is  accomplished  by  storing  the  data  in  row(0)  (Le^  processors  Pjf  :  0  <  t  .for  fixed  i 

and  j)  and  by  sending  the  data  through  the  cycle  for  execution  of  the  dimensions.  Only  the  r  lateral 
dimensions  of  the  CCC  are  then  used,  and  therefore  the  length  of  the  cycle  r  does  not  need  to  be  a 
power  of  two.  .An  (Nm)-CCC  will  emulate  a  v-dimensional  binary-cube  whose  processors  are 
£(0).£(1),  .  .  .  J’(.V-l).  We  establish  the  following  correspondence  between  MCCC  processors  and 
cube  processors: 

Pl\f*—'Ph  h  =  j  N  /m  +  i  .V  im2  +  q.  6.12 

It  is  easy  to  see  that  dimensions  are  assigned  to  the  CCC  modules,  dimensions 

£ . . . £  r*M_i  are  assigned  to  the  mesh  columns,  and  finally  dimensions  £ ._H, ...£ are  assigned 
to  the  mesh  rows.  Applying  the  by  now  familiar  techniques  for  emulating  the  cube  with  a  CCC  or  a 
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linear  array,  an  Ascend  (or  Descend)  algorithm  can  be  executed  in  O  (r+m  )  word  steps.  On  operands  of 
length  k,  with  bit-serial  transmissions  and  operations,  the  computation  time  is  T  -  Oi  (r+m  )  k  ).  For 
m  in  the  range  [ClUogN  X  O  ( V N  /logN  )] ,  considering  that  r  =  O  ( logN  X  we  obtain  T  -  Of  mk). 

In  conclusion,  for  Ascend  and  Descend,  the  MCCC  achieves  AT 2  =  0(N  2k 2),  for 
T  €[nu  logN  XO(k  VjV  /logN  )  1  which  is  optimal.  (A  variant  of  this  result  has  been  proved  by 
[A83]  for  a  network  similar  to  the  MCCC) 

It  would  not  be  difficult  to  see  that  the  MCCC  in  the  form  just  described,  does  not  achieve 
optimal  performance  when  executing  the  Bitonic  Sorting  paradigm.  However,  it  is  also  easy  to  realize 
that  the  problem  lies  in  the  assignment  of  the  topmost  Ip  dimensions.  .As  we  have  already  seen  for  the 
ordinary  mesh,  the  best  strategy  consists  in  an  alternate  assignment  of  these  dimensions  to  column*  and 
rows.  Formally,  if 

‘  =  ±i,2<,  j  *  £/,2',h*  =  £(2i,  +  ;,)2a 

h  mi  h  h  mi 

we  then  establish  between  the  processors  of  the  MCCC  and  those  of  the  cube  the  correspondence: 

Pi1£'—‘Ph .  h  =  h'N  /m2  +  q.  6.13 

With  this  correspondence,  dimensions  £(>,... ,£r_j  of  the  binary  cube  are  assigned  to  the  CCC  modules, 
dimensions  E  JE  r*2, ...  are  assigned  to  the  mesh  rows,  and  dimensions  E  ...  are  assigned  to  the 

mesh  columns.  When  executing  the  bitonic  sorting  paradigm  O(rlogn)  word  steps  are  used  by  the 
CCCs.  In  fact  there  are  v  -  logn  merging  phases,  and  each  of  them  involves  no  more  than  r  CCC 
dimensions.  .As  for  the  mesh  dimensions,  they  are  used  exactly  in  the  same  way  as  in  a  bitomc  sorting 
algorithm  on  the  mesh,  and  therefore  their  execution  takes  Gm)  word  steps.  Globally,  0(m  +  r  logN  ) 
word -steps  are  needed.  Since  t  =  0  (logN ),  for  operands  of  length  k  and  for 
m  €[f2(i.'og  *.V  ),  Oi'J N  Ik  )]  we  obtain  T  -  CXmk).  Recalling  that  A  =  0(.V-/m*)  we  have  proved 
the  following  theorem.  (The  number  of  keys  n  equals  the  parameter  <V  of  the  MCCC.) 

Theorem  63.  The  mesh-of-CCC  can  sort  n  keys  of  length  k  in  tune  7  -  CHkm)  and  area 
A  =  0'n:;m2\  for  any  n  in  the  range  [fl (log2n)  ,  O ( v'n  ,k  )].  For  <  =  logn  +  (Xlogn  ).  the 
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performance  is  AT  2  -  0(n  ^og2* ) ,  for  T  €[  fKZog  3n  )  ,0  ( 'Jrdogn  )J. 

We  have  already  obtained  three  optimal  sorters  implementing  the  bitonic  algorithm  and  respec¬ 
tively  based  on  the  mesh,  the  pleated  CCC,  and  the  mesh-of-CCG  It  is  indeed  possible  to  construct 
several  other  optimal  emulators  of  the  binary  cube,  by  suitable  combinations  of  known  emulators  (for 
example  the  shuffle-exchange  can  be  used  to  define  the  mesh-of -shuffles,  or  the  shuffle-of -meshes,  or  the 
shuffle-connected-cycles).  Although  a  systematic  classification  of  the  emulators  of  the  cube  is  a  problem 
interesting  in  its  own  right,  it  would  not  shed  further  light  on  bitonic  sorting,  and  therefore  we  do  not 
pursue  it  here. 

However,  there  is  one  aspect  of  the  P CCC  and  of  the  M CCC  which  is  not  satisfactory,  namely 
that  they  do  not  achieve  computation  times  smaller  that  Clilog  3n ),  while  the  only  obvious  lower 
bound  is  SlUog  ),  since  there  are  (logn  +  l)logn/2  consecutive  steps  in  the  bitonic  sorting  paradigm. 

The  discrepancy  is  obviously  due  to  the  fact  that  a  bit-serial  mode  is  adopted  both  for  transmis¬ 
sion  and  comparison-exchange  operations.  In  fact,  we  can  speed  up  the  execution  of  bitomc  sorting  by 
resorting  to  parallel  comparison-exchange,  but  -  for  k  =  (Xlogn )  -  each  comparison  step  requires 
tltiogk  )  =  (liloglogn )  time,  so  that  the  global  sorting  time  is  still  Q (log  2nioglogn  ) . 

To  circumvent  this  difficulty  we  need  to  apply  the  pipeline  principle  not  only  to  the  words  of  a 
given  sequence,  but  also  to  the  bits  of  a  given  word.  Prior  to  modifying  the  MCCC  according  to  this 
idea,  we  discuss  a  mode  of  operation  of  the  CCC  network,  which  we  call  the  bii- pipeline  mode  is  con¬ 
trast  with  the  standard  mode,  which  we  call  the  word-pipeline  mode. 

For  concreteness,  we  shall  illustrate  the  bit-pipeline  mode  in  the  case  of  the  bitonic  merge  algo¬ 
rithm,  which  is  indeed  a  Descend  algorithm  where  the  operation  steps  consist  in  comparison-exchanges. 

To  sort  a  bitonic  sequence  of  size  n  =  2"  we  display  a  (i/xl^MXC  with  1/2"  processors  P 1 . 
(0  <  i  <  v,  0  < /'  <  n ) .  All  processors  in  row(0)  (i.e,  T ,>  also  equipped  with  a 

shift  register  cpable  to  store  /fe-bit  operands.  All  the  edges  are  realized  with  unit  bandwidth,  and  data 


transmission  is  serial. 


A  bitonic  vector  ([Ba6S])  A[Ol  ....  ,4(n-l]  is  initially  input  with  component  loaded  in  Pi. 


Then,  at  each  dimension  E  . . . .  £  ,,£0  pairs  of  elements  are  compared  and,  if  necessary  .exchanged  to 
place  the  smaller  of  the  two  in  the  cycle  with  the  smaller  number.  More  specifically,  each  processor 
reads  the  inputs  starting  from  the  most  signi ficant  bit  and  compares  them.  As  long  as  the  two  inputs 
agree,  they  are  transmitted  to  the  next  processor  in  the  same  cycle.  As  soon  as  a  discrepancy  is  detected, 
a  switch  is  set  and,  from  then  on,  the  remaining  substrings  of  each  operand  follow  a  fixed  path, 
independently  of  their  value. 

For  operands  of  k  bits,  the  algorithm  takes  Oik  +v)  units  of  time,  in  contrast  with  the  Oik  v ) 
units  of  time  used  in  word-pipeline  mode.  If  k  -Oilogn),  as  for  the  keys  considered  in  this  chapter,  then 
T  -  Oilogn). 

Let  us  now  consider  an  (n/nl-MCCC  for  the  execution  of  the  bitonic  sorting  paradigm,  where  the 
CCC  modules  function  in  the  bit-pipeline  mode.  The  only  difference  in  performance  is  that  the  first  r 
dimensions  (r  =  logn  —2  logm  )  use  Oir+k  )  steps,  each  time  that  a  group  of  them  is  executed,  namely 
for  each  merging  phase.  Thus,  since  r  »  Oilogn  ).  k  »  Oilogn),  and  the  merging  phases  are  v  -  logn , 
the  CCC  dimensions  E  . . .  JZ  r_j  globally  take  O  ( log  2n  Xime.  Nothing  is  changed  for  the  mesh 
dimensions  £  . . .  .£  „_ltwhich  take  Oimiogn )  time,  so  that,  for  the  enure  algorithm. 
T  =  Oilog2n  +  mlogn  ). 

By  considering  m  in  the  range  [  Q  ilogm ),  O  ( Vm  /logn  )I  we  have  then  proved  the  following 
theorem. 

Theorem  6.4.  The  mesh-of -CCC  can  sort  in  keys  of  length  k  —logn  —  dilogn  )  with 

AT  2  =  O  in  :log:n  )  for  T  €[Clilog  2n  ).  Oi  •Jnlogn  )  ]  . 

We  have  now  exhausted  the  potential  of  bitonic  sorting.  To  obtain  faster  sorters  we  have  to  con¬ 


sider  other  algorithms. 


63  NETWORKS  FOR  MERGE-ENUMERATION  SORTING 

In  this  section  we  concentrate  on  "very  fast"  VLSI  sorters.  The  main  objective  is  to  design  sorters 
with  minimum  running  time  T  -  (Klogn ).  To  achieve  area-time  optimality,  these  sorters  must  have 
area  A  =  din  2) . 

The  first  dilogn )  time  VLSI  sorter  has  proposed  by  [L$l]  and  [NMB831  and  it  is  based  on  the 
Muller-Preparata  algorithm  executed  by  the  orthogonal-tree  (OT)  network-  We  briefly  review  it  in  the 
sequel. 

63.1  The  Orthogonal-Tree  Sorter. 

To  sort  n  keys  Xu....  of  length  k,  let  us  consider  an  OT  network  as  the  one  discribed  in  Sec¬ 
tion  5.3.2.  The  Muller-Preparata  algorithm  (see  Section  533)  is  executed  as  follows. 

1.  Key  X.  is  input  at  the  root  of  row  tree  ST, ,  and  broadcast  to  the  leaf  processors  P,nT,  l, . . .  T"  ~l 

(t  -  0,1, . . . 

2.  Processor  Pi  sends  its  context  Xy  to  the  root  of  column  tree  CT ,  which  in  turn  broadcasts  it  to 

the  leaf  processors  (;  -  0,1, . . .  si  -1). 

3.  Processor  P.J  -  which  we  assume  to  be  equipped  with  Oik)  bits  of  memory,  and  with  a  serial  com¬ 
parator  -  compares  X,  and  Xy,  and  produces  a  bit  C(/.  Cly  is  one  if  X,  >  X;  or  if  X.  =  Xy, 
and  t  >  j,  and  C,y  is  zero  otherwise  ii.j  -  0,1, ...  ji  -1). 

4.  The  internal  nodes  of  row  tree  ST,  -  which  we  assume  to  be  equipped  with  a  serial  adder  with  a 

n  —l 

one-bit  delay  feedback  on  the  carry  -  compute  the  sum  C  =  £C,y.  The  sum  is  indeed  produced 

;  =«> 

at  the  root  and  will  then  be  broadcast  to  all  the  leaves.  Obviously  C.  is  the  rank  of  X,  in  the 
sorted  output  (i  -  0,1, . . .  ji  -1). 

5.  Processor  P.J  compares  C,  with  j.  If  C,  P,J  remains  idle.  If  C,  =  j .  P.;  sends  its  content  ,V. 


to  the  root  of  tree  CT , . 
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6.  The  root  of  CTy  can  now  output  the  number  received  from  the  leaf,  which  will  be  (as  usual 

YoXx.... X„  . i  is  the  sorted  sequence  corresponding  to  multiset  X ,*X  j, . . .  Ji„  _ j). 

Both  operations  and  data  transmission  are  done  bit-serially,  so  that  all  the  edges  of  the  OT  can  be 
realized  with  unit  bandwidth.  It  is  easy  to  see  that  each  step  of  the  algorithm  takes  at  most  Oik  +  logn) 
time.  Thus,  for  k  =  logn  +  (Klogn ),  the  global  running  time  is  T  —  (Klogn  X 

The  OT  is  area-time  suboptimal  because  its  area  is  A  »  O  (n  2log2n  ).  However,  it  is  an  interesting 
network,  and  it  is  also  useful  as  a  building  block  of  optimal  networks,  as  we  shall  see  in  the  following 
sections. 

63.2  A  Network  for  the  Merge-Enumeration  Combiner 

We  now  turn  our  attention  to  the  class  of  merge-enumeration  combine-sort  algorithms  (Section 
53.3).  Since  the  original  description  of  these  algorithms  [P7S]  is  related  to  the  shared-memory  machine, 
we  need  to  investigate  possible  implementations  with  finite  degree  networks. 

We  begin  by  proposing  a  parallel  network  for  the  fundamental  block  of  the  sorter,  namely  the 
( mil-combiner,  where  we  will  assume  that  m  S2M  and  l  =  2*-  are  powers  of  two.  This  network  will 
accept  as  input  m  sorted  sequences  of  l  elements  each, 

Sj  -  (Sj  (OU(l),...^,  a  —  l))  ,  i  *  0,1,... 1 

and  produce  as  output  a  single  sorted  sequence  S,  which  is  the  combination  of  S,p...£m  _ lt  and  has 
L  -m  l  =2A  elements 

S  -  is  (0)j  (1),...^  ( L  -1)). 

The  (mil-combiner  will  execute  the  algorithm  based  on  pairwise  merging  as  outlined  in  the 
preceding  section.  Its  organization  is  illustrarted  in  Figure  6.5.  It  consists  of  m-  modules  (each  capable 
of  merging  two  sequences  of  length  l  and  of  computing  partial  ranks),  laid  out  as  a  square  m  xn  mesh 
and  indexed  as  A/, .  a  ,j  -  0,1, . . .  /n-l).  The  modules  of  each  row  are  interconnected  as  the  leaves  of  a 
binary  tree  of  bandwidth  / ;  so  are  the  modules  of  each  column.  Thus,  the  combiner  has  the  structure 


of  the  orthogonal-trees  machines,  whose  leaves  are  merging  modules.  The  interconnecting  trees  have  the 
following  functions: 

(1)  to  "broadcast"  a  sequence  to  all  units  in  which  it  must  be  merged  with  some  other  sequence; 

(2)  to  compute  global  ranks  from  partial  ranks; 

(3)  to  rearrange  the  elements  according  to  their  ranks  into  the  sorted  sequence  5. 


CT- lines 


Figure  6.8. 


Overview  of  (m^t)-COMBI\ER,  for  m  -  4. 


We  will  now  describe  in  some  detail  the  merging  modules  and  the  interconnecting  trees. 

Merging  modules.  Merging  module  M.J  will  merge  sequences  S,  and  Sy  and  compute  Cu(h  ),  for  h  - 
0, . . . ,  1-1.  We  recall  that  CtJ(h  )  is  the  number  of  elements  of  S;  that  are  less  than  (respectively  less 
than  or  equal)  s,  ( h  )  when  i  <  j ,  (when  i  >  j ).  Each  module  is  realized  as  a  (X  +  1  x2k*l)-CCC  (See 
Figure  6.9.)  We  shall  refer  to  the  processors  of  module  M  J  as  micromodules  and  we  shall  index  them 
as  ?,\p,  with  0  <  t  +1,  and  0<  q  <  2*+1. 


RT-lines 

“  V 

First  sorted  sequence 


CT-lines 

Second  sorted  sequence 


Figure  6.9.  Merging  unit  M. .  realized  by  a  (3,23)-CCC,  used  to  merge  two  sequences  with  four  ele 
ments  each. 


The  layout  area  of  a  merging  module  is  of  order  O (h2)  (Section  5.3.1). 

Interconnecting  trees.  As  indicated  earlier,  the  merging  modules  are  interconnected  by  two  families  of 
L  -  ml  complete  binary  trees  with  m  =  2M  leaves  and  bandwidth  1.  We  will  refer  to  these  families  as 
the  row  trees  and  column  trees. 

The  lines  of  the  row  trees  and  the  column  trees  are  respectively  labelled  FT,  ( h )  and  CT,  (h ), 
i  -  0, . . .  jn- 1;  h  -  0, . . .  /-I.  The  trees  and  the  merging  modules  are  connected  through  a  small  inter¬ 
face,  whose  structure  will  be  fully  specified  in  connection  with  the  description  of  the  combination  algo¬ 
rithm  in  the  next  section.  At  this  point  we  just  ay  that  the  leaves  of  ST,  (h )  are,  from  left  to  right, 
connected  to  the  CCC  micromodules  P,°^  J*,)? >  the  leaves  of  CT(h)  are  connected  to  the 

.  }  ■  i  fff  j  j  |  ^  ;  j 

CCC  micromodules  PM  Tio  ....  J*m  -u  :  tn  other  words,  the  row  trees  and  the  column  trees 
are  respectively  connected  to  the  RT  and  CT  lines  of  the  merging  modules.  The  connection  between 
each  leaf  of  a  tree  and  the  coresponding  CCC  micromodule  is  realized  through  a  buffer  register  of  the 
appropriate  size  (adequate  to  store  one  element  to  be  sorted).  The  situation  is  illustrated  in  Figure  6.10. 


CTo(3)...CTo(0) 


Figure  6.10.  Interconnection  of  modules  and  trees. 
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633  The  Combination  Algorithm 

We  describe  now  how  the  merge-enumeration  algorithm  can  be  executed  by  the  network  intro¬ 
duced  in  the  preceding  section.  For  convenience  we  split  the  algorithm  into  several  phases. 

(A)  Input  of  Data  and  Broadcasting  to  Merging  Modules 

Element  st  (h  )  is  input  at  the  root  of  tree  KT,(,h\  and  is  then  broadcast  to  all  leaves  of  the  tree. 
At  this  point,  the  left  half  of  row(0)  in  module  MtJ  contains  the  sequence  5,  -  To  fill  the  right  halves  of 
row(0)  of  all  modules,  we  proceed  as  follows.  First,  in  each  "diagonal"  module  Mu  the  sequence  5,  is 
copied  in  the  second  half  of  row(0).  (This  can  be  done  by  using  the  connection  of  row(X)  between  the 
left  and  the  right  half  of  the  machine.)  Next,  from  micromodule  Pj£~l*h ,  which  is  a  leaf  of  CTjih ), 
element  s j(h)  is  broadcast  (through  the  root)  to  all  other  leaves  of  the  same  tree.  At  this  point,  the 
merging  module  M,  J  contains  S,  and  Sy  in  row(0)  and  merging  can  begin. 


(B)  Merging  and  Partial  Rank  Computation 

Merging  can  be  executed  by  resorting  to  the  bitonic  algorithm,  and  using  the  CCC  modules  m  a 
bit-pipeline  node,  as  explained  in  Section  6.2.7.  However,  in  order  to  execute  bitonic  merging,  we  first 
need  to  reverse  the  order  of  S  .  This  is  accomplished  by  an  Ascend  algorithm  in  which  columns  L  to 
21-1  of  each  MtJ  exchange  their  data  at  dimensions  Eq,...£  K-\  while  columns  0  to  M,  remain  idle. 
All  the  columns  are  idle  at  dimension  E  K  . 

Now  the  data  are  ready  and  bitonic  merging  can  be  executed.  At  the  end  of  merging,  the  result 
resides  in  row(0)  of  the  CCC,  and  the  element  in  P,:jf  0  ^  h  ^  21—  1  .  has  rank  h  in  merge 
(5,  JS  j  )•  Now  we  want  to  transmit  the  ranks  of  s,  (0), _ s,(l  —1)  to  processors  P,tf ....  J>,Jf  “l .  respec¬ 

tively.  This  is  accomplished  by  retracing  backwards  the  path  traversed  by  each  element  s,  ( j  ).  and  is 
easily  done  if  each  P/jf  keeps  track  of  whether  it  exchanged  or  not  the  operands  during  the  merging 
sprocess.  So,  all  we  have  to  do  is  to  run  the  machine  backwards,  with  an  Ascend  algorithm,  which 
applies  to  the  ranks  the  inverse  of  the  permutation  that  merged  the  elements.  At  the  end  of  this  pnase. 
processor  P/  f ,  0  £  l  —1,  stores  the  number  of  elements  in  merge  (S,  S . )  that  are  less  than  s,  (h  ). 
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If  from  this  number  we  subtract  h  we  obtain  Ct]  (h  ),  the  number  of  elements  of  Sy  which  are  less  than 
s,(h) .  We  call  the  CtJ  ‘s  partial  ranks  because  from  them  we  can  compute  the  rank  of  each  s,  (h  )  in 

m  —1 

the  sorted  sequence  S  as  C,(h )  =  £  C(>  ( h  ). 

(C)  Total  Rank  Computation 

It  is  immediate  to  see  that  at  the  end  of  phase  B  the  partial  ranks  C, {,(h  ),C,  ,(X  M _i (h )  of 

r,  (X  )  are  available  exactly  at  the  leaves  of  row  tree  RT,  (h  ) .  By  having  in  each  internal  node  of  the 
tree  a  full  adder  with  1-bit  delay  feedback  on  the  carry,  we  can  then  obtain  at  the  root  of  RT,  the  sum 
C,  (h )  of  the  values  stored  at  the  leaves.  The  nodes  work  as  serial  adders  and  the  tree  is  used  in  a  pipe¬ 
lined  fashion,  so  that  the  time  required  is  0(/i+X),  where  n  =  logm  is  the  depth  of  the  tree,  and  X+l 
is  the  wordlength  of  the  operands  (note  that  CtJ  {h )  <  2*0.  Within  the  same  order  of  time,  we  can 
subsequently  broadcast  C,  (h )  from  the  root  to  the  leaves.  (Indeed  C,  (h )  <  so  it  can  be 

expressed  by  X  +  p.  bits.) 

(D)  Sorting  Permutation  and  Output  of  Data 

We  want  to  output  the  elements  siO),  .  .  .  ML-1)  of  the  sorted  sequence  from  the  roots  of  the 
column  trees,  and,  specifically,  we  want  the  root  of  CT  ;(h )  to  output  element  s(j2K  +  h).  This 
corresponds  to  a  natural  right-to-left  order  of  the  column  trees  as  they  appear  in  the  layout  of  Figure 
6.10. 

Considering  a  generic  element  s.  (p )  with  rank  C,  ip ).  the  binary  spellings  of  the  integers  j  and 
h  so  that  s,(p)  will  emerge  from  the  root  of  column  tree  CT  .(h)  are  readily  obtained  by  taking  the 
M  most  significant  bits  and  the  X  less  significant  bits  of  the  rank  C,{p  )  to  represent  h  and  j,  respec¬ 
tively.  Thus,  as  a  first  step,  we  "activate"  in  MtJ  the  elements  of  sequence  S,  that  have  to  emerge  from 
trees  CTy ’s,  and  "inhibit"  all  other  elements.  The  active  elements  are  those  whose  rank  C,  ip )  has  the  /x 
most  significant  bits  agreeing  with  the  column  number  j  of  the  merging  module.  Next,  we  rearrange 
the  active  elements  in  so  that  s,{p)  is  sent  to  P.{f  ,  with  h  =  C.ip)  mod  l . 
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This  operation  is  essentially  a  permutation  of  the  active  (and  non-active)  elements,  and  can  be 
done  by  using  the  CCC  as  an  emulator  of  the  Benes  network  [Be64].  The  setting  of  the  switches, 
although  nontrivial,  is  greatly  simplified  with  respect  to  the  general  case  by  the  fact  that  the  active  ele¬ 
ments  do  not  change  their  relative  order.  The  desired  rearrangement  can  be  done  by  using  the  idea  of 
concentration  introduced  in  [NS52L  and  expansion,  which  could  be  viewed  as  the  inverse  of  concentra¬ 
tion.  If  t  elements  are  active  in  the  given  module,  they  are  first  sent  to  the  t  leftmost  colums  of  the 
CCC  (concentration),  and  then  routed  to  the  destination  columns  (expansion).  A  straightforward  adap¬ 
tation  of  the  algorithm  that  is  proposed  in  [NS82]  for  concentration  in  the  cube-machine  shows  that  an 
Ascend  and  a  Descend  phase  is  all  that  is  required  to  rearrange  data  on  our  CCC.  Some  bits  required  to 
set  the  switches  must  be  precomputed.  This  task  could  be  performed  by  the  CCC  or  (to  keep  the  micro¬ 
module  structure  as  simple  as  possible),  the  task  can  be  assigned  to  a  binary  tree  of  full  adders  whose 
leaves  would  be  contained  in  the  interface  between  the  CCC  and  the  row-trees. 

During  the  entire  rearrangement  task,  computation  takes  place  only  in  the  left-half  of  the  CCC 
without  using  dimension  E^  We  then  transfer  each  active  element  from  Pff  to  PlJJ~l*n  with  a 
straightforward  use  of  dimension  E  v 

At  this  point  element  s  (j  2K  +  h  )is  in  P,Jf  -l+'1 ,  (where  the  value  of  i  is  determined  by  the  input 
sequence  to  which  s(j2K  +  h)  originally  belongs),  and  is  ready  to  be  transmitted  to  the  root  of 
CT  j  ih ),  where  it  is  output. 

Performance  Analysis  and  Modification  of  the  Network.  Since  both  the  CCC s  and  the  inter¬ 
connecting  trees  work  in  pipeline  in  bit-serial  mode,  any  operation  takes  time  proportional  to  the  sum 
of  the  operand  length  and  the  pipe  depth.  For  the  CCC  the  depth  is  X  +  1  and  the  operand  length  is 
either  k  (input  words)  or  X  +  1  (partial  ranksX  Since  a  constant  number  of  Ascend  and  Descend  algo¬ 
rithms  are  executed,  we  conclude  that  <XX  +  k  )  total  time  is  spent  in  the  CCCs.  For  the  trees  the  depth 
is  fi  ■+•  1,  and  the  operand  length  is  either  k  (input  words)  or  X  +  p.  (total  ranks).  Since  a  constant 
number  of  fan-in  and  fan-out  algorithms  are  executed,  we  conclude  that  Oik  +  p.  +  k  )  total  time  is 
spent  in  the  trees.  Thus,  the  time  spent  in  the  interconnecting  trees  dominates  that  spent  in  the  CCCs. 
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Recalling  that  a  full  binary  tree  on  m  aligned  leaves  is  laid  out  in  height  Qilogm  )  and  that  there  are  l 
row  and  column  trees;  we  conclude 

Lemma  6.1.  A  full-tree  (2*,2x)-combiner  of  keys  of  length  k  can  be  laid  out  in  a  square  of  width 
Oi.fi 2U  m))  and  operates  in  time  T  =  OiX  +  p  +  k  X 

We  now  observe  that  when  k  -  (1(2MX  then  T  *  OiX  +  k  ).  In  this  case  the  time  performance 
of  the  trees  is  insignificantly  degraded  if  we  realize  them  as  comb-trees,  rather  than  as  full  binary  trees. 
The  depth  increases  from  p  to  2M  (which  is  tolerable  in  time  since  2*“  -  Oik )),  but  the  layout  area 
decreases  by  a  factor  of  O  ip2).  We  conclude: 

Lemma  6.2.  A  comb-tree  (2M,2x)-combiner  of  keys  of  length  k  =  fl(2M)  can  be  laid  out  in  a  square  of 
width  O  (2U  +  m))  and  operates  in  time  T  =  O  iX  +  k  ). 

Summary  of  Symbols  for  an  ( mj. )-Com biner 
Sizes:  m  -  2“ ,  l  -  2K ,L  -  ml,  k  =  keylength . 

Input  sequences: 

Si  =  iSi  (OU  (l),...^(Z-l))  i  =0,1. . . .  jn  -1. 

Output  sequence: 

Merging  modules:  (<\  +  l,2x  ~  l)-CCCs 
Mtj  :  i,j  ■  0,1, . . .  vn-1 

P/f  :  0  <  X  +  1,  0  &  q  <  2K ,  micromodules  of  M  .. 

Row-trees  and  column-trees: 

PT,ih\ CT,ih)  :  0  <  i,j  <m-\.0*sh  <1-1. 


63.4  The  Sorter 


The  combiner  can  be  used  to  construct  a  general  network,  for  combination-sort.  As  an  intermedi¬ 
ate  step  in  the  construction,  we  introduce  a  new  operation  called  coalescence.  Given  a  collection  of  n 
elements,  partitioned  into  n  H,  sorted  subsequences  each  containing  l,  elements,  and  given  a  multi¬ 
ple  l,  of  l,  _i,  which  is  also  a  divisor  of  n,  we  call  ( n  ;  l,  l.  Coalescence  the  operation  of  combining  (in 
the  sense  defined  earlier)  consecutive  blocks  of  m,  -l,  /l,  sequences. 

If  we  refer  to  the  tree  of  Figure  5.1.  we  can  easily  see  that  each  level  of  the  tree  corresponds  to  a 
coalescence  of  the  input  sequence.  If  we  call  coalescer  a  network  that  performs  a  coalescence,  we  can 
build  a  combination-  sorter  by  cascading  a  suitable  set  of  coalescers,  as  shown  in  Figure  6.11. 

The  coalescer.  An  (n  ; l, l,  )-coalescer  can  be  easily  constructed  by  using  n,  -  n/ 1,  (m, 
combiners.  Let  us  assume  for  simplicty,  that  n,  is  a  perfect  square.  We  can  then  lay  out  the  combiners 
in  an  •JrTx^fn^  array  with  input  and  output  lines  running  in  a  chosen  direction,  say,  parallel  to  the 
rows.  . 

To  estimate  the  area  of  the  coalescer,  we  first  assume  to  use  full-tree  combiners,  so  that  the  side  of 
the  combiner  has  a  length  of  O  (1,  logm, )  (parallel  to  the  rows).  Using  Lemma  6.1  we  have 


Height  =  O  iyfn, lj  logm,  +  n,  l, )  =  O 


logm, 


Width  =  0{'fn,l,  logm, )  =  0 


logm, 


An  example  with  n  -  4  is  shown  in  Figure  6.12.  The  computation  tune  is  readily  found  as 
Tr  ~  Oi\+k  +iogm,\  We  conclude: 


Lemma  63.  .An  (n  )  full-tree  coalescer  can  be  laid  out  in  a  rectangle 

Oin'l  +  logm.  xOinlogm,  and  operates  in  time  Tr  =  0(\  +  k  +  logm.)  (k  is  the 

input  keviength,  n,  =  n  /l,  ,  m,  =  n  ).  When  k  logm  -r  (Klogn  ),  then  7 F  —  0  ( logn  ). 


(njl'-mj) 

Coalescer 


(nim1-,m1m2) 

Coalescer 


Figure  6.11.  Combination-sorter  as  a  cascade  of  coalescers 
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(width  X  height  )  O(n)  X  0(n  ),  O (rtlogiogn  Hogn  )  X  0(n  ),  and  0(n  )  X0(n  ),  respectively.  It  is 
also  clear  that  the  total  time  is  Oilognl.  So  we  have: 

Theorem  6-5.  There  is  a  VLSI  merge-enumeration  sorter  of  n  keys  of  length  k  -  logn  +(Klogn  )  with 
area  A  —  0(n2),  and  computation  time  T  -  CXlogn). 

Remark.  The  first  coalescer  stage  of  the  sorter  we  have  just  described  consists  of  log2*  sorters,  each  pro¬ 
cessing  a  sequence  of  n  /log  hi  keys.  These  sorters  are  essentially  orthogonal-tree  sorters  of  the  type 
described  in  Section  6.3.1.  Strictly  speaking,  for  l  -  1,  the  (ni)-combiner  of  Section  6.3.2  consists  of  two 
families  of  binary  trees  (RT0(0), . . .  JZTm  _1(0) )  and  (CT 0(0), . . .  ,CTm  _j(0) )  such  that  the  y'-th  leaf  of 
FT,  (0)  and  the  t-th  leaf  of  CT/( 0)  are  constructed  (they  indeed  form  the  merging  module  M,j ), 
whereas  in  the  OT  network  they  would  be  identified. 


i 


t 


Figure  6.13.  An  optimal  VLSI  merge-enumeration  sorter  with  three  coalescers. 


6 33  Sorting  in  Time  T  €[  P  i  logn  ) ,  0  ( log  m )] 


We  have  seen  that  AT2  -  (Xn^og2/! )  can  be  achieved  for  T  -  Oilogn  )  (Theorem  6.5)  and  for 
T  €[f 1(log2n  )jOi'Jnlogn  )]  (Theorems  6.1,  62,  6.4),  It  is  natural  to  try  to  extend  the  result  to  the 
interval  T  €[  Pilogn  X  O  (logn )  1.  For  this  purpose  we  start  from  the  following  observation.  A 
combine-sorter  with  tvs  input  can  sort  (in  time  Ofslogn )  the  area  Oin  2/s2))  s-  2a  sequences  of  n/s 
elements  each.  These  sequences  can  then  be  fed,  say  one  per  column,  into  an  (mj>MCCC.  At  thus  point, 
the  sequence  in  each  CCC  module  is  already  sorted,  and  the  MCCC  is  ready  (after  inverting  the  order 
of  some  sequenes  to  comply  with  bitonic  sorting  rules)  to  execute  the  last  2c r  merging  phases.  (For  the 
sake  of  simplicity  we  will  ignore  the  fact  that  only  or  phses  would  be  really  necessary  after  the  work 
done  by  the  combination-sorter.)  A  simple  analysis  allows  us  to  conclude  that,  in  the  process,  the  MCCC 
executes  Oflogs+s)  steps  using  Oilogn)  time  for  each,  thus  running  for  a  total  time  T  -  Ofslogn).  We  can 
then  state: 

Theorem  6£.  There  is  a  VLSI  sorter  of  n  keys  of  length  k  =  logn  +  Oilogn  )  with  optimal 
.AT 2  ~  9(n  :log:n  )  for  any  computation  time  T  €( (If logn  ).0(  > JnLogn  )  ]  . 

With  Theorem  6.6.  the  characterization  of  the  area-time  complexity  of  the  (n ,  logn  -r  9flogn  )  V 
sorting  problem  is  complete  (within  multiplicative  constant  factors). 


6.4  OTHER  OPTIMAL  NETWORKS 

For  completeness,  we  report  here  two  other  interesting  results  concerning  optimal 
(n  ,  logn  -r Oilogn  ))-sorters. 

The  first  result  is  due  to  Leighton  [L84],  and  provides  a  design  that  achieves  optimal 
.AT2  -  (Mn2log-n)  for  T  €  [ Q( logn  ) , 0 ( >/ nlogn  )  ].  The  network  consists  of  a  suitably  intercon¬ 
nected  family  of  OT-networks.  The  algorithm  is  a  combine-sort  of  the  hybrid  type,  with  the  first  stage 
of  combinations  performed  with  the  Muller-Preparata  algorithm,  and  the  remaining  stages  <ont  or  two 
depending  on  T)  penormed  with  the  multiway-shufSe  algorithm. 
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We  refer  the  reader  to  [LS4]  for  more  details.  However,  we  shall  return  to  Leighton’s  algorithm  in 
Section  7.2,  where  we  study  circuits  to  son  keys  of  medium  iength. 

The  second  result  is  due  to  Bilardi  and  Preparata  [BP84c]  who  have  shown  that  the  AKS  network 
[AKS83]  can  be  optimally  laid  out  in  area  A  =  O  (n:),  while  maintaining  a  sorting  time  T  -  OUogn) 
on  keys  of  Ollogn)  bits. 

The  details  are  rather  intricate,  and  hence  are  not  repeated  here.  .An  open  question,  as  far  as  we 
know,  is  the  existence  of  optimal  networks  that  execute  the  AKS  sorting  algorithm  in  time  greater  than 
Q(logn ).  Obviously,  the  answer  to  this  question  would  not  improve  the  characterization  of  the  area¬ 
time  complexity  of  sorting,  since  we  have  already  several  optimal  constructions,  but  could  shed  some 
light  on  the  algorithm  itself. 


CHAPTER  7 


SORTING  KEYS  OF  ARBITRARY  LENGTH 

7.1  INTRODUCTION 

In  Chapter  6.  we  have  studied  in  depth  the  (n,£)-sorting  problem  for  the  special,  but  important, 
case  when  k  -  logn  +  6  (logn).  In  this  chapter  we  consider  the  general  problem  of  sorting  keys  of  arbi¬ 
trary  length. 

The  classification  of  keys  into  short  {k  4  logn),  medium-length  (logn  <  Jfc  <  2  logn),  and  long 
(2 logn  ^  k  ),  introduced  in  Chapter  4  in  the  context  of  lower-bound  arguments,  maintains  its  validity 
when  considering  circuit  constructions.  Indeed,  a  different  algorithm  and  a  different  network  are 
appropriate  to  each  of  the  above  three  intervals  of  key  lengths. 

The  difference  between  the  VLSI  model  and  other  models  of  parallel  computation  reveals  its  full 
extent  in  the  present  chapter,  where  an  attempt  to  optimize  the  area-time  performance  of  VLSI  sorters 
leads  to  the  formulation  of  novel  sorting  algorithms. 

For  short  and  medium-length  keys  the  efficiency  of  the  new  algorithms  is  based  on  the  use  of  the 
appropriate  encoding  schemes  for  the  multisets  being  processed.  For  long  keys  the  efficiency  of  the  algo¬ 
rithms  rests  instead  on  the  adoption  of  non  word-local  I/O  protocols  that  induce  a  partition  of  the  chip 
into  regions  within  which  primary  flow  is  confined,  and  among  which  oniy  secondary  flow  is 
exchanged. 

All  the  algorithms  we  shall  consider  in  this  chapter  make  use  at  some  stage,  of  an  (n,logn  *  fl 
( Iogn))-sorting  procedure.  Thus,  the  constructions  confirm  that  the  key  length  k  -  logn  *  B  (logn)  plays 
a  special  role,  as  a  careful  analysis  of  lower-bound  arguments  had  already  indicated  La  Section  6.1. 


V, 
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Sorting  algorithms  and  networks  for  medium-length,  short,  and  long  keys  are  respectively  dis¬ 
cussed  in  the  next  three  sections  of  the  chapter. 

7.2  SORTERS  FOR  KEYS  OF  MEDIUM  LENGTH 

In  this  section  we  derive  upper  bounds  for  the  (n  Jogn  +h  ^sorting  problem  for  0<h  <logn .  We 
recall  from  Theorem  4.14  and  4.15  that 

AT2  =  Slin  2h 2)  7.1 

and,  for  boundary  chips, 

AT  2  =  il(n  2h  logn )  .  7 2 

When  h  —  dClogn  ),  lower  bounds  7.1  and  12  are  of  the  same  order,  and  they  are  both  achieved 
by  the  constructions  of  Chapter  6.  However,  a  careful  analysis  of  the  upper  bounds  reveals  that  they 
are  of  the  form  AT2  *  Sin  \logn  +  A  F),  so  that  even  if  his  zero  we  have  .47"-  =  Sin  ‘log^n  ).  Thus, 
for  h  -  oGogn),  the  sorters  of  Chapter  6  are  slightly  subopt imaL 

In  the  following,  we  shall  see  that  the  performance  of  these  sorters  can  indeed  be  improved  by 
exploiting  the  fact  that  a  multiset  of  n  keys  of  length  k  ~logn  +h  can  be  encoded  with  2n(h  +1)  bits, 
as  it  has  been  shown  in  Section  22.4. 

In  the  design  of  our  sorter  for  keys  of  medium  length,  we  shall  use  an  approach  very  frequently 
adopted  in  the  design  of  VLSI  networks,  which  can  be  formulated  as  follows.  Let  II  be  a  problem 
amenable  to  a  divide-and-conquer  solution,  and  let  us  assume  that  we  are  trying  to  solve  II  with  target 
performance  .47* 2  *  0(n 2)  on  input  instances  of  size  n.  Additionally,  let  us  suppose  that  a  design  for 
II  is  known  with  performance  A^T 02  =  0(g(n)n 2),  where  gin)  •  &  monotone  increasing  function  of 
n  -  is  the  gap  between  the  performance  of  the  known  design  and  the  target.  Then,  if  we  decompose  the 
problem  into  gin)  subproblems  of  size  n  /gin ),  we  can  solve  the  subproblems  with  gin)  networks  of 
performance  (.40(n  /g  (n  )J0in  /g  in  ))),  globally  achieving 


A{T}  *  g  (n  )0(g(n  /gin  ))n2/g2(n  ))  =  Oin2). 

Thus,  to  obtain  the  desired  result,  we  are  left  with  the  problem  of  combining  the  solutions  to  the  gin) 
subproblems  in  area  and  time  of  the  same  order  as  A  t  and  7  t  respectively. 

This  approach  effectively  transforms  the  task  from  the  design  of  the  entire  system  to  the  design  of 
a  subsystem  for  the  combination  step  of  the  divide  and  conquer  strategy.  According  to  intuition,  the 
better  is  the  construction  that  we  use  for  the  subproblems,  Le.  the  smaller  is  the  gap  gin),  the  smaller  is 
the  number  of  subproblenu  that  have  to  be  combined,  and  therefore  the  easier  is  the  combination  step. 

For  the  sorting  problem  we  are  presently  considering,  we  already  know  several  designs  achieving 
A  oT  j  *  O  in  2logJn ) ,  and  we  can  try  to  follow  the  above  approach.  For  concreteness,  we  refer  to  the 
boundary  chip  case,  so  that  our  target  is  a  design  with  performance  .47 2  =  Oin2h  logn  \  and 
g  (n )  =  logn  /h .  Thus,  we  can  son  logn  /h  sequences  of  nh  /logn  elements  each  within  an  area-time 
performance  allowed  by  our  objective,  and  we  are  then  left  with  the  problem  of  combining  these 
sequences. 

For  this  combination  we  shall  use  Leighton’s  multiway-shuffle  algorithm,  for  reasons  that  will  be 
apparent  as  description  of  the  soner  unfolds  and  that,  at  this  point,  we  can  informally  explain  as  fol¬ 
lows. 

To  attain  the  A72  =  Cl(n2h  logn )  lower  bound  we  cannot  afford  to  maintain  the  list  represen¬ 
tation  of  the  input  multiset  throughout  the  entire  algorithm.  Indeed,  this  would  imply  an  Ciiniogn ) 
information  exchange  across  a  suitable  bisection  of  the  network,  whereas  we  can  only  afford  an  Oinh  ) 
information  exchange.  Thus,  it  is  essential  to  compactly  encode  the  multiset,  or  some  pan  thereof,  in  the 
stages  of  the  algorithm  that  pose  the  heaviest  demand  in  terms  of  global  rearrangement  of  data. 

We  snail  indeed  use  the  insen-and-prune  encoding  scheme  to  solve  this  problem.  On  the  other 
hand,  when  a  multiset  is  compactly  encoded,  the  individual  elements  are  not  easily  accessible  for  opera¬ 
tions,  say.  as  comparison-exchanges,  therefore  it  is  very  desirable  to  be  able  to  use  the  compact  form 
oniy  for  data  transmission,  and  to  recover  the  natural  list  representation  wherever  operations  are  to  be 


executed. 


The  multiway-shuffle  combination  is  ideally  suited  to  our  purposes,  because  the  only  global  rear¬ 
rangement  of  data  occurs  when  shuffling  and  unshuffling  the  keys,  while  the  other  stages  of  the  algo¬ 
rithm  require  data  interactions  only  within  the  blocks  of  a  suitable  partition  of  the  input  multiset. 

There  is  a  difficulty,  however.  The  insert-and-prune  encoding  is  itself  based  on  sorting  a  sequence 
of  length  twice  as  large  as  the  one  being  encoded.  Thus,  using  the  sorters  of  Chapter  6,  we  cannot  encode 
the  entire  input  multiset  at  once  without  exceeding  our  target  performance.  Hence,  encoding  will  be 
applied  to  suitable  subsets  of  keys.  However,  this  entails  a  loss  in  the  efficiency  of  the  encoding.  (The 
reader  will  easily  convince  himself  that  the  optimal  encoding  of  S  jUS2  requires  fewer  bits  than  the 
sum  of  the  number  of  bits  required  to  encode  S  i  and  S  2  separately.)  This  difficulty  will  be  at  least 
partially  overcome  by  resorting  to  a  recursive  technique. 

We  have  presented  the  main  ideas  involved  in  the  design  of  the  sorter  of  medium-length  keys, 

and  we  are  ready  to  give  a  detailed  description  of  the  construction. 

* 

The  ideas  we  have  informally  presented  above  will  be  combined  according  to  the  following 
scheme,  illustrated  in  Figure  7.1,  consisting  of  three  basic  steps; 

1.  Given  a  sorter  design,  we  show  how  to  constuct  an  encoder/decoder  of  multisets  based  on  the 
insert-and-prune  method.  The  area-time  performance  of  the  encoder/decoder  will  be  a  function 
of  the  performance  of  the  sorter  used  in  the  construction. 

2.  Given  designs  of  an  encoder/decoder  and  of  a  multiway  shuffler/unshuffler,  we  show  how  to  con¬ 
struct  another  shuffler/unshuffler  whose  performance  is  better  than  the  one  of  the  original 
shuffler/ unshuffler. 

3.  Given  designs  of  a  shuffler/ unshuffler  and  of  a  sorter,  we  show  how  to  construct  a  new  sorter 
(with  improved  performance)  by  resorting  to  multiway-shuffle  combination. 

The  scheme  will  be  iteratively  applied.  The  first  stage  of  the  iteration  will  use  a  sorter  of  perfor¬ 
mance  AT2  =  0(n2log2n  ),  and  a  straightforward  implementation  of  shuffler  and  unshuffler  with  the 
same  performance.  Subsequent  stages  will  use  as  a  suiting  point  the  designs  for  the  sorter  and  for  the 
shuffler/ unshuffler  obuined  in  the  previous  stage. 


sorter 


Shuffler/Unshuffler 
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-Prune 


Encoder/Decoder 


Multiway 

Shuffle 


New  Sorter 


•  New  Shuffler/Unshuffler 


Figure  7.1.  Basic  steps  in  the  construction  of  sorters  for  keys  of  medium  length. 


In  Sections  7.2.1  and  1SL2  we  shall  describe  in  detail  the  basic  steps  of  Figure  7.1.  Indeed  one  of 
them,  which  is  the  multiway-shuffle  combination  sort,  has  already  been  discussed  in  Section  5.2.4. 

The  ideas  we  have  introduced  could  be  applied  to  obtain  boundary-chip  sorters  as  well  as  non- 
boundary-chip  sorters,  although  the  construction  of  the  latter  is  somewhat  more  involved.  For  the  sake 
of  simplicity,  we  shall  develop  the  boundary-chip  case,  and  we  shall  adopt  the  following  conventions. 
All  our  circuits  for  sorting,  encoding/decoding,  and  shuffling/unshuffling  will  be  laid  out  in  a  region  of 
rectangular  shape,  with  the  input  ports  on  the  north  side  and  the  output  ports  on  the  south  side  of  the 
rectangle.  The  width  of  the  rectangle  will  then  be  proportional  to  the  number  of  I/O  bits  divided  by 
computation  time.  The  height  will  instead  depend  on  the  bisection  flow  that  we  are  able  to  achieve. 
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and  will  be  our  objective  limitation  (to  be  reduced).  Thus,  for  a  complete  («,logn  +  hh sorter,  our  target 
is  a  width  O  ( nlogn  IT  )  and  a  height  O  (nh  IT  ). 

7.2.1  Inaert-and-Prune  Encoder  and  Decoder 

We  recall  from  Section  2JL4  that  the  insert -and-prune  encoding  of  a  multiset  Ug,...,X,_i)  of 
words  of  length  logn  +h  is  obtained  by  sorting  the  multiset 

{Xu.... 

and  pruning  the  logn  -  1  most  significant  bits  of  each  word  in  the  resulting  sequence. 

Thus,  an  insert-and-prune  encoder  can  be  easily  realized  by  a  simple  modification  of  any  of  the 
sorters  described  in  Chapter  6.  Indeed,  it  is  sufficient  to  consider  a  (2n  Jogn  +/i  )-sorter  such  that  n  of  the 
input  keys  are  prestored  and  have  the  fixed  values  02h  ,2x2* . (n  —1)2*  .  The  performance  of  such 

encoder  is  then  AT 2  =  C>(n2log2n  )  for  T  €(Q(logn  ),  O  (  Vn  logn  )  ]. 

« 

The  decoder  is  slightly  more  complex.  Let  the  insert-and-prune  encoding  of  !X con¬ 
sists  of  the  sequence  of  2n  words  (W,„W  lf . . .  ,W  _,)  of  h  *■  1  bits  each,  with  W,  =  WthW ,*  _1  •  •  •  W, 
The  following  algorithm  enables  us  to  obtain  the  X’s  from  the  W’s. 

1.  For  i  =0,1 . 2n  —1,  compute  the  value  of  the  binary  variable  b,  defined  as 

b,  =  0  if  either  (t  -  0)  or  (i  >  0  and  W*  =  W.Li ) ' 

6,  =  1  if(i  >  0  and  W*5*W*_, ). 

2.  Compute  the  cumulative  sum  of  the  sequence  b, ,  defined  as  B,  ~  £bh.  where 

n  w1 

B,  =  B!°*n  "2  •  •  •  B,  lS,°  is  a  word  of  logn  bits. 

3.  Form  a  new  List  W  & ....  W  ^  _2  where  W  ,  =  b,  Bf°tn  ~l. . .  B, 1 W,*  ...W, 

4.  Son  the  W  , ,  and  prune  the  most  significant  list  of  each  key.  The  first  n  keys  of  the  resuiting 

sequence  are  }’„_,)  =  son  (X  X„  _,),  and  they  form  the  sorted  list  representation 

of  the  multiset  encoded  bv(W’„,...,iy11  _,). 


Step  4  poses  the  heaviest  demand  of  area-time  resources.  Thus,  the  insert-  and-prune  decoder  can  be  also 
realized  with  AT 2  =  OinHogm  X  for  T  €[QUogn  )jO(Jn  log n  )J. 

In  general,  we  can  use  the  construction  outlined  above  to  obtain  an  encoder  or  a  decoder  from  any 
given  sorter.  It  is  also  convenient  for  our  applications  to  combine  the  encoder  and  the  decoder  into  one 
block,  whose  performance  is  stated  in  the  following  lemma. 

Lemma  7.1 .  Given  a  design  for  an  (nJcX- sorter  with  computation  time  T,(njc )  and  height  H,(njc ), 
we  can  construct  an  encoder/decoder  with  time  and  height  respectively  given  by 

TED(nJc )  <  fT.inJt)  7.3 

and 

H£D{ njc)  <  -lyH.UjcX  7.4 

where  £  and  T)  are  suitable  constants  (independent  of  n  and  k\  greater  than  one. 

7.2.2  Reducing  the  Bandwidth  for  Shuffling  and  Unshuffling 

The  multiway-shuffle  combination  (Section  5.2.4)  is  so  denoted  because  two  of  the  steps  of  the 
algorithm  respectively  consist  of  a  p-unshuffle  and  of  a  p-shuffle  of  ml  elements  (where  p  divides  IX 

Indeed,  these  two  steps  are  the  only  ones  that  require  a  global  rearrangement  of  the  input  keys, 
and  therefore  pose  the  heaviest  demand  of  bandwidth.  Thus,  it  is  crucial  to  be  able  to  perform  the 
shuffling  and  the  unshuffling  very  efficiently. 

In  general,  both  the  multiway  shuffle  and  the  multiway  unshuffle  of  .V  words  of  K  bits  can  each 
be  executed  by  a  circuit  that  works  in  time  7 and  has  width  Oi.\'K  !TS[  )  and  height 
H<i  <  <tNKTs\  =  0(NK  /Tsl-X  where  <r  is  a  constant.  The  lengthy,  but  rather  straightforward 
details  are  not  given  here.  (A  network  for  similar  operations  is  described  in  some  detail  in  [BS  S4jj 

Although  in  the  general  case  the  performance  of  the  circuit  mentioned  above  is  optimal,  in  the 
specific  application  we  have  in  mind,  the  shuffle  and  unshuffle  are  performed  on  sequences  that  can  be 
decomposed  into  sorted  subsequences,  which  can  be  compressed  by  encoding  techniques.  As  a  result  we 


can  achieve  a  smaller  height  for  the  circuit. 

We  shall  exploit  the  following  decomposition  of  the  multiway-unshuffle  and  of  the  multiway- 
shuffle  permutations,  illustrated  in  Figure  12. 


I  Elements  I  Elements 

< - ►  -e - ► 


ml  Elements 

■  ■  - — — - — - ► 

FP— 8*56 


Figure  12. 


Cascade  decomposition  of  the  p-CNSHUFFLE  of  ml  elements.  (Arrows  represent  se¬ 
quences  of  V p  elements.) 


(1)  The  p-unshuffle  of  ml  elements  (where  p  divides  l)  can  be  performed  by  (la)  applying  a  p- 
unshuffle  to  each  of  the  m  subsequences  that  we  can  form  with  l  consecutive  elements,  and  then 
(lb)  applying  a  p-unshuffle  to  the  sequence  of  the  mp  sequences  (regarded  as  single  words)  of  1/ p 
consecutive  elements  in  the  arrangement  resulting  from  (la). 

(2)  The  p-shuffle  of  ml  elements  (where  p  divides  l)  can  be  performed  by  (2a)  applying  a  p-shuffle  to 
mp  sequences  (regarded  as  single  words)  of  V  p  consecutive  elements,  and  then  (2b)  applying  a  p- 
shuffle  to  each  of  the  m  subsequences  of  l  consecutive  elements  in  the  arrangement  resulting  from 
(2a). 

We  plan  to  use  a  shuffler/unshuffler  block  as  part  of  a  multiway-shuffle  combiner.  In  this  context, 

the  sequence  to  be  shuffled  or  unshuffled  consists  of  m  sorted  subsequences  of  l  consecutive  elements 

each.  In  this  case,  it  is  easy  to  see  that  the  sequences  that  are  regarded  as  words  in  the  second  stage  of 

the  decomposition  are  sorted.  Thus,  they  can  be  encoded  by  the  insert-and-prune  method,  and  then  be 

recovered  with  appropriate  decoding.  This  consideration  suggests  the  scheme  of  Figure  7.3  for  the  entire 

« 

unshuffle  operation.  A  similar  scheme  works  for  the  shuffle.  Obviously  the  same  method  would  not 
work  for  unsorted  inputs,  since  after  encoding  we  would  be  able  to  recover  only  the  multiset  underly¬ 
ing  the  encoded  sequence,  but  not  the  sequence  itself. 

If  in  the  design  of  Figure  7.3  we  make  the  unshuffling  blocks  bidirectional,  and  we  replace 
encoders  and  decoders  with  encoder/decoder  blocks,  we  obtain  a  network  that  can  also  shuffle.  We  now 
analyze  the  performance  of  such  shuffler/unshuffler  block,  for  the  case  when  p  *  m  and  under  the 
assumption  that  we  use  building  blocks  with  the  following  features  (for  later  convenience,  we  use  a 
superscript  i  to  denote  quantities  related  to  building  blocks,  and  a  superscript  (i  +1)  to  denote  quantities 
related  to  the  overall  design). 

(a)  The  encoder/decoder  blocks  which  operate  on  sequences  of  n  /m :  elements  in -  mi)  of  k  bits  each, 
work  in  time  T±DKn  /m2Jz)  and  have  height  Hr^in  im 2Jc  ). 

(b)  The  shuffler.' unshuffler  blocks  which  operate  on  sequences  of  n.m  elements  of  k  bits  each,  work  in 
time  Tjc(n  /m  Jc ).  and  have  height  H,L(n  !m  Jc  ). 


Sorted  Sequence  Sorted  Sequence 


Figure  7.3.  Cascade  decomposition  of  the  />- UNSHUFFLE  of  m  elements  when  each  of  the  m 
subsequences  input  by  blocks  of  the  first  stage  are  sorted.  Encoders  (E)  and  decoders  (D) 
operate  on  subsequences  of  l/p  elements. 

(c)  The  shuffler/ unshuffler  block,  which  operates  on  mz  (encoded)  sequences,  is  realized  according  to 
the  straightforward  method  mentioned  at  the  beginning  of  this  section.  Here  N  -  m 2 ,  since  the 
items  being  shuffled  are  m 2 ,  and  each  item  consists  of  a  sequence  of  n  /m2  words  each  represented 


with  h,^-k  -logn  +  2  logm  bits  (see  the  insert-and -prune  encoding).  Thus, 
//sl  <r2nhm/Ts[  . 


7J5 


It  is  then  easy  to  see  that  the  performance  of  the  entire  shuffler /unshuffler  circuits  obtained  by  cascad¬ 
ing  the  different  stages  is  given  by 

=  2trnh,  ♦j/T’sl  +  Hsi’(n /mjc)  +  lH£D(n /m2£)  7.6 

T 5{;Kn  Jc )  =  7”jt'  +  7j[r(/i/iii^)  +  2  7 £$ in,  /m  Jc  ).  7.7 

7.2 3  The  Sorters 

We  consider  now  a  network  consisting  f  a  set  of  m  (n/mjdsontrs  and  of  an  m- 
shuffler/ unshuffler  of  n  keys.  Such  a  network  can  easily  perform  all  the  steps  required  by  the 
multiwav-shuffle  combination  algorithm  described  in  Section  5.  Obviously,  the  m  sorters  can  also 
prepare  the  sorted  sequences  to  be  processed  by  the  combiner,  and  -  with  small  adaptations  -  they  can 
also  perform  the  sorting  operation  in  the  'windows'  (refer  to  Section  5.2.4). 

Thus,  if  we  realize  the  sorters  with  a  design  with  performance  Tjin  /m  Jc  ).  Hiin  im  Jc  ),  and  the 
shuffler/ unshuffler  with  a  design  with  performance  H^Kn  Jc  ),  T£Hn  Jc  ),  we  obtain  a  sorter  with  glo¬ 
bal  performance  given  by  the  following  relations; 

Hj^KriJc)  <  H'si-Knjc)  +  H‘s(.n/mjc)  7.S 

T}  -Kn  Jc  )  $  yi7£  Kn  Jc  )  +  in  /mjc)  7.9 

where  y,  and  y2  are  constants.  In  fact  the  (n/mJchsontrs,  and  the  shuffler/ unshuffler  are  activated  a 
constant  number  of  times  during  the  entire  algorithm. 

With  reference  to  Figure  7.1,  we  have  now  completed  the  description  of  the  steps  that  allow  to 
obtain  the  "new  shuffler/ unshuffler"  and  the  "new  sorter",  given  a  sorter  and  a  shuffler  unshuffler.  We 
shall  repeatedly  use  these  steps  to  construct  a  sequence  of  designs,  as  follows. 


We  begin  with  a  design  for  the  sorter  having  performance 

HsHn  Jc  )  75Kn  Jfc )  ^  Cslnk  7.10 

where  C51  is  a  constant.  Such  performance  can  be  achieved  by  any  of  the  sorters  described  in  Chapter 
6,  as  long  as 

Tjlogn  ^  T sKn  Jc )  ^  fj  sJnlogn  7.11 

for  suitable  constants  r5*  and  t$.  For  the  shuffler/unshuffler  we  begin  with  the  straightforward  imple¬ 
mentation  that  achieves 

Hsi(njc)  T&injc)  ^  <r nk  7.12 

as  long  as 

TjV  <  T si-{n  Jc)  <  7.13 

for  suitable  constants  t£v  and  fjc . 

We  can  then  define  a  sequence  of  designs  where  the  (i+l)-th  one  is  obtained  from  the  i-th  one 
according  to  the  scheme  illustrated  in  Figure  7.1.  A  value  m.  must  also  be  chosen  for  the  parameter  m 
specifying  how  many  sequence,  of  n/m  keys  are  to  be  presorted  by  sorters  of  the  i-th  type.  We  shall 
choose  m,  -  L,  in  )  where  L,  is  the  i-th  iterate  of  the  logarithm,  formally  defined  by 


L  t(n  )  -  logn 

7.14 

L  )  -  log(f.,_1(n ) ),  i  >  1 

7.15 

We  claim  that  the  sequence  of  sorters  and  shuffler/unshufflers  so  defined  satisfies  the  relations 


XsT's  $ 

Csnh, 

7.16 

Hk-Tk 

*  C&nh, 

7.17 

for  n  large  enough,  when  Q  and  C$i  are  constants,  and 


h,  -k  —  logn  +2  L,(n).  7.18 

Inequalities  7.16  and  7.17  can  be  proved  by  induction.  For  t  ■  1,  they  follow  from  7.10  and  7.12  with 
any  Cs[  greater  than  or  equal  to  cr. 

In  general,  using  7.3,  7.4.  7.S,  7.9.  7.10  and  7.12,  substituting  L,(m  )  for  m,  and  taking  7.16  and 
7.17  as  inductive  hypotheses,  we  obtain 


HkoTko  h, 


Hgl  <  2<xnh,^/Tsi 

+  Ckv  nh,  /(Z.,  (n )  T&  (n  /L,  (n  U  ) ) 

+  t)  |  Cj  n/z,  /(£,  2(n  )  Tj  (n  /I,  =(n  U  ) ) 


T'it1  <  Tsl  +  rjc(n/Li(n\k)  +  2  T^in /L,Kn)Jc)  721 

Mi  ~l  ^  +  Hs(n/L,(n)Jc)  722 

T$~x  ^  ?1  Tjf1  +  y,  T's(n /L.(n)Jc  ).  722 

\Ve  are  further  allowed  to  choose  7j(n  /£,  Kn)Jc)  and  Tsi-in  !L,  in  )Jc  )  within  the  range  of  possible 
sorting  and  shuffling 'unshuffling  computation  tunes  relative  to  the  i-th  design.  If  we  choose  them  to  be 
proportional  to  T  .  then  inequalities  7.20  and  7.21  imply  that 

^  lava  /i,  -f  o  («  h,  /L.  (n  ) ).  7.24 

^  '*  ~  *  0(L, '  .  ),  then  A.  /£,  in;  =  Ofl),  ar.d  we  obtain 

Aj’1  7  if1  ^  2  a  cr  n  h  +  /ower  order  reruns  7.25 

Under  the  same  assumptions,  inequalities  7.24  and  7.25  yield  for  the  sorter 

i  ^  2a!C,y1nn,.,  -r  lower  order  terrzs.  ~.2b 

Thus.  .;  we  .lehr.e 


. •  •  *  *  *  _*  ,*  ,*  .*  *_•  "  «  *_•»*_«  »*»*.•  *  k  '»  »  g'*  i  '  .  *  «  *  k**  .  ?  .  »  , 
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Cj  -  2  a  a-  y„ 


Qt-  ^  2  or  a. 


then  73,6  and  735  show  that  inequalities  7.16  and  7.17  hold  for  i  +  1. 

The  previous  discussion  can  be  summarized  as  in  the  following  theorem. 

Theorem  7.1  For  any  i  ^  1,  an  (n  logn  +  O  (.Lj  (n  ) )  )-sorter  with  I/O  ports  on  the  boundary  can  be 
constructed,  such  that 

AT  2  =  O  (n  2lognL,  (n ) )  739 

forT  €  [ (lilogn  \  0( 'JnL,  (ft  ) )  J.  Such  sorter  is  optimal  if  k  =  logn  +  9 {L, (n  ) ). 

The  ideas  exploited  in  this  section  could  also  be  used  to  design  non-boundary 
(n  logn  +  O (Z.j (ft  )  )  )-sorters  with  AT 2  —  0(n 2L, 2(n  ) ).  However  the  constructions  are  rather  ela¬ 
borate  and  do  not  add  further  insight  to  the  problem  of  sorting  medium  words,  and  therefore  are  not 
reported  here.  • 


73  SORTERS  FOR  SHORT  KEYS 

In  this  section  we  derive  upper  bounds  for  the  (n,£)-sorting  problem  when  the  keys  are  short,  uu 
when  k  <  logn,  or  equivalently,  when  the  size  z  -  2*  of  the  universe  is  not  larger  than  the  size  n  of 
the  multiset  being  sorted. 

We  recall  from  Chapter  4  the  lower  bounds  for  this  problem.  We  restrict  our  attention  to  word- 
local  designs.  In  fact,  as  indicated  by  Theorem  4.S,  non- word-local  protocols  lead  to  larger  information 
exchange  than  word-local  ones. 

It  is  useful  to  introduce  the  quantity 

d  —  ft  ir  7.30 

which,  as  we  shall  see  below,  plays  an  important  role.  Then,  for  boundary  chips,  we  have  from 
Theorem  4.7  that 


v'^VVVi 


.47"-  —  Slid  logn  r2  log  r  ) 


7.31 


For  non-boundary  chips.  Theorems  4.9  and  4.10  respectively  yield- 
AT2  *  fl(d  r2) 
and 

.4 T  =  QU  r3/2) 

Furthermore,  Theorems  4.11  and  4.12  tell  us  that 
.4  =  fl(r  log  (l  +  n  lr  ) ) 

and 

T  ~  Cl(logn). 


7.32 


7.33 


7.34 


7.35 


7.3.1  The  Algorithm 

Here  we  propose  a  new  sorting  algirithm.  specifically  tailored  to  short  keys,  and  we  also  describe  a 
VLSI  implementation  of  it,  whose  performance  comes  very  close  to  the  above  lower  bounds. 

The  main  idea  of  the  algorithm  consists  in  using  an  efficient  encoding  for  multisets  of  small  keys 
in  the  intermediate  stages  of  the  sorting  process.  We  shall  in  fact  encode  a  multiset  5  by  means  of  its 
distribution  function.  Let  us  recall  (Equation  2.15)  that  if  S  is  a  multiset  on  the  universe 
C  =  10,1 —  x  —  1 1.  then  the  multiplicity  of  an  element  i  €  U  is  defined  as 

uu  >  —  number  of  occurrences  of  element  i  in  multiset  S, 

and  the  distribution  function  (Equation  2.19)  is  defined  as  the  vector 

i  =  0,1..../-—  1). 

A  simple  but  useful  property  is  that  the  distribution  of  the  union  of  two  multisets  5  and  R  is  simply 
Si  ■  ,p^i  l  ~  3/.- 1 1  1  SI-  li ”.3A 
’v:th  obvious  meaning  of  the  symbols.  Thus,  we  can  say  that  the  merging  of  two  sequences  is 
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transformed  -  in  the  distribution  encoding  -  into  the  sum  of  their  distribution  functions.  This  property 
is  used  to  design  the  following  simple  algorithm: 

1.  (ENCODE)  Subdivide  the  input  multiset  {X  <>•••.  Xn  _*}  into  d  -  rdr  submultisets  of  r  keys  each, 
and  compute  the  distribution  function  of  each  submultiset. 

2.  (TALLY)  Sum  the  d  distribution  functions  (as  r-component  vectors)  obtained  in  Step  1,  to  pro¬ 
duce  the  (global)  distribution  function  of  the  entire  input  multiset. 

3.  (BROADCAST)  Replicate  the  global  distribution  function  d  times. 

4.  (DECODE)  From  the  i-th  replica  of  the  distribution  function  obtain  the  r  consecutive  output  keys 

Y „  Xr+I . Y,r  _i  (i  -  0,1, ....  d  -1),  with  a  suitable  decoding  procedure. 

The  rationale  for  Step  4  is  the  wish  to  deploy  decoders  comparable  to  the  corresponding  encoders;  this 
creates  the  need  for  Step  3,  the  d- way  replication  of  the  distribution  vector. 

A  preliminary  step  is  the  discussion  of  the  algorithms  for  encoding  and  decoding,  which  turn  out 
to  be  based  on  merging  and  sorting  operations. 


73.2.  Transcoding  Operations 

In  order  for  the  algorithm  outlined  above  to  be  efficient,  we  need  an  efficient  way  to  obtain  the 
distribution  encoding  of  a  multiset  from  its  list  representation,  and  vice  versa.  We  propose  now  some 
algorithms  to  perform  these  transformations  of  encodings. 

List-to- Distribution  I  Encoding).  Given  a  multiset  S  represented  by  a  list  I X  ,*X  t,  -ii  with 

X.  €£/  =  {0,1 . r  —  1).  we  define  a  sorted  list 

Z  =  (Z,>Z . . Z„  .,)  ^  sort  (S  U U  ).  7.37 

If  fAi )  is  the  multiplicity  of  i  in  S,  then  the  structure  of  Z  is  a  concatenation  of  runs  of  identical  sym¬ 
bols 


Z 


>j(0)+t  >.(!)+!  u(r  —  U-rl 


7.3S 


f.  V, 


If  we  consider  the  last  element  in  a  run  of  the  form  (underscored  in  7.38),  we  can  see  that  its 

index  in  the  sequence  Z  is  b  =  A/  (i )  +  i ,  where  M  (i )  —  p(0)  +  pi  1 )  +  . . .  +  pli ).  The  last  element  in 
a  run  can  be  easily  recognized  because  it  differs  from  its  successor.  Thus,  we  can  construct  a  sequence 
W '  defined  as 

b-Zh  ifZ*+1,  *Z4  (.that  is,w  h  *Af(Z*» 

t IT  •  _  /J“ 

"  n  if  Z4+1  =  Z, 

If  we  sort  W '  and  define  W  —  sort  (W '),  all  the  elements  of  W '  that  are  equal  to  n  will  occupy  the 
last  n  position  of  W,  and  we  can  extract  the  distribution  of  M  of  multiset  S  from  the  first  r  positions, 
uez 

M  —  (A/  (0), .. .,  A/(r  —1))  *  (W o, ....  Wr_j)  7.40 

If  necessary,  the  multiplicity  could  be  obtained  as  pli )  =  M  (i )  —  A/  (i  —1),  where  M  (— l)  —  0. 

Example. 

S  *  |<W.1A4A4A7.7|.  (n  -  10).  U  -  {0,1  A3.4A6.7).  (r  -  8). 

Z  =(0.0.1.1.1.2^.3.4.4.4.4J.6.6.7.7.7) 

W  =(  10.1.10.10.3.10.4.4.10.10.10.7.10.8.10.10.10) 

\V  =(  1.3,4,4,7,7,S,10,10 . 10) 

M  =(1.3,4.4,7.7,8,10) 
p  -  ( 1,2.1, 0,3,0, 1,2) . 

Diszribuzion-to-lisi  {decoding).  Given  the  distribution  vector  A/  —  (A/  (0)3/  (1). .. . ,  M  (r  —  1) )  of  a 
multiset  5.  whose  sorted  list  representation  is  (Y  ,_JY  lt . . .  J'„  _|).  we  want  to  compute  a  set  of  p  consecu¬ 
tive  elements  of  this  list  starting  at  Y h ,  Le.  we  want  to  compute  (1\  ,Y ,  2" .  _]  J.  Obviously,  if  5 

=  0,and  p  -  n.  we  obtain  the  entire  sorted  sequence  of  5.  However,  as  we  snail  see,  it  is  useful  to  be  able 
to  compute  different  portions  of  sequence  (Y  Xn  _i)  independently  of  each  other. 

The  method  proposed  ,s  based  on  the  following  idex  If  Yn  -  :,  then  there  are  at  least  h  -  1  ele¬ 
ments  of  S  not  larger  than  i.  and  at  most  h  elements  smaller  than  i.  so  that  M  (i  —  1)  <  h  <  XI  ; 
Thus,  u  we  insert  h  (0  ^  h  <  r.  —  1)  into  the  sorted  seaueace  \I  —  '  A/  . V/  — 1 )  and  had  the 


value  i  such  that  M  (i  —  1)  ^  h  <  M(i )  we  can  conclude  that  Yb  -i. 

Then,  if  we  want  to  compute  Yb  Xb  +1, . . .  Xb  we  have  to  simultaneously  insen  the  elements 

of  the  sequence  B  —  (b ,  b  +1, . . . ,  b  +p  —  1 )  into  sequence  M,  which  can  be  done  by  merging  M  and  B. 
For  later  use,  we  first  append  to  each  of  the  keys  to  be  merged  a  tag  field  with  value  zero  for  elements 
in  M,  and  with  value  one  for  elements  in  B.  Then  we  merge  M  and  B  in  a  stable  way  obtaining  a 
sequence 

W  =  merge  (M  Ji  ).  7.41 

In  sequence  W,  an  element  h  of  B  follows  M  (0 )M  (l \...M(Yh  -l),  as  well  as  b , b  +1 . h  -1,  and 

will  therefore  occupy  the  (J’A  +  h  —b  )-th  position.  If  the  y-th  element  of  W  comes  from  sequence  B 
(which  we  can  test  from  the  tag  field)  and  has  value  h,  then  we  update  it  as 

W,  -?  -<Wf  =  7.42 

At  this  point,  by  a  simple  unmerge  of  the  sequences  of  zero  tags  and  one  tags,  we  obtain  a  sequence  of 
one-tag  elements  equal  to  ( J'4 .  Y„  +l, . . .  +,  _,X 

Example.  M  -  (1,3,4,4,7,7,8,10).  (Multiset  S  is  the  same  as  in  the  previous  example.)  B  -  (6,7,S), 
Le.  6  ■  6,  and  p  *■  3.  If  we  denote  "tag-one*  by  underscoring  we  have 

W  =  (10,4,4^7  JA5.10). 

The  three  underscored  elements  W  MW  7,  and  W  9  are  updated  according  to  7.42  yielding 
w  *  -  4  —  (6—6)  =  4,  W  7  =  7  —  (7—6)  =  6 ,W  9  =  9  —  (8—6)  =  7,  so  that  Yt  =  4J'7  =  6  and  }'  s  =  7. 

In  general,  the  elements  of  both  .V  =  (M(0),...,M (r-l))  and  B  =(bj>+ 1 . 6+p-l)  are 

numbers  in  the  range  0,1, . . .  ji-l,  and  their  binary  representation  requires  logrt  bits.  We  discuss  now 
some  modifications  of  the  above  procedure  that  allow  us  to  work  with  numbers  with  k+l  bits,  at  least 
in  the  case  when  b  -  ir ,  and  p  =  r ,  which  is  needed  in  our  sorting  algorithm. 

If  B  =  (ir  +1, . . .  +ir  — 1),  we  can  replace  M  (h  )  by  ir  whenever  M  ( h  )  <  ;r ,  and  by  fi+/!r 
whenever  M(h)  >  (i+l)r,  without  affecting  the  order  of  elements  of  sequences  3  and  M.  This  observa¬ 
tion  suggests  the  definition  of  a  new  sequence,  which  we  call  the  i-modified  distribution  function.  i.e. 


ir  if  M(h)  <  ir 

M,(h)  -  M(h)  if  ir  <  M(h)  <  (t+l)r  7.43 

(i+lV  if  M(h  )  >  (i  +1> 

Then,  the  sequence  (>’„  .T,,  +,  _j)  can  be  obtained  by  a  straightforward  modification  of  the 

above  decoding  procedure,  operating  on  the  squences  M,  *  (A/,  (0) . M, (r— l)),  and 

B  -  (0,1 . r  — l),  rather  than  on  sequences  A#  and  B.  The  advantage  lies  in  the  fact  that  elements  of 

M,  and  B  can  be  represented  with  A>1  bit  (instead  of  logrt). 

In  the  sorting  algorithm  outlined  in  Section  7.3.1,  both  the  encoding  and  the  decoding  procedures 
are  applied  to  multisets  of  r  elements  of  4  -  log  r  bits  each.  It  is  then  easy  to  see  that  both  the  encoder 
and  the  decoder  can  be  realized  as  simple  modifications  of  a  (2rJog  r+l)  sorter.  These  modifications  can 
be  done  without  affecting  the  (order  of  the)  area- time  performance  of  the  sorter  itself. 

7-3.3  The  Network 

We  discuss  first  a  nonpipe  Lined  version  of  the  network,  and  then  we  obtain  the  area-time  trade-off 
by  means  of  a  pipelined  version. 

W'e  recall  that  n.  r  -  2‘  .  and  d  •  n/r  are  powers  of  two,  and  we  introduce  the  following  subse¬ 
quences  of  the  input  and  of  the  output  sequences  of  the  sorter 

S.  —  (X„.  «X,r  ■,),  .  ..*X,r  *r  _j), 

R.  -<Y,rXu-  .T 

W'e  also  consider  the  distribution  function  of  multiset  S  . 

.V/;  —  ( A/,  (0 ), . . .  (r  —1 )  J 

which  is  a  vector  with  r  (k  4-l)-bit  components. 

The  nonpipelined  version  of  the  sorting  network  Ls  the  cascade  of  four  parts,  illustrated  m  Figure 
7.4.  each  performing  one  of  the  four  steps  of  the  algorithm. 


m  -  d  *  n/r  blocks - - 

r  Keys  r  Keys  r  Keys 

II  II 
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Figure  7.4.  Structure  of  the  network  for  sorting  a  set  of  short  keys. 

(a)  (EXCODERS)  Encoders  Ej  _lt  each  capable  of  computing  the  distribution  of  a  given  (rjc)- 

multiset.  Encoder  E 1  inputs  S;  and  computes  M  .  We  assume  that  each  encoder  has  r  input 
lines  and  r  output  lines,  and  that  I/O  operations  on  words  are  bit-serial 

(b)  (TALLY  TREES)  Tally  trees  TL^...  JLr -i  ,each  a  full  binary  tree  on  d  leaves,  where  a  node  at  r  • 

distance  l  from  the  leaves  is  equipped  with  an  O  (O-bit  storage  and  an  (-bit  operand  carry-save 

adder,  and  is  connected  to  its  father  by  0(1)  wires.  The  ;-th  leaf  of  tallv  tree  7\L*  is  connected  to 

t"  - 

the  h-th  output  line  of  encoder  £,,  from  which  it  will  read  -  in  bit  serial  fashion,  LSB  first  -  the  r- 

distribution  value  M  j(h ).  By  summing  M 0(h  ), . . .  _t(h  ),  T Lh  computes  \I(h).  Thus  each  •*-. 

tree  tallies  d  A-bit  numbers  to  produce  a  (k+logd)  -  logn  bit  result.  The  operation  of  a  tally  tree  ■>.; 

—  r- - 


•*>s 
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is  illustrated  in  Figure  7.5.  First,  for  each  bit  position  we  obtain  a  logd-bit  count  of  its  l’s  (this  is 
done  by  suitable  adders  at  the  nodes  of  the  tree);  next  the  bit-counts  are  added  with  the  correct 
alignment  (carry-release)  at  the  root  of  the  tree.  Each  of  these  additions  is  performed  in  0(1) 
time  on  a  redundant  carry-save  representation.  The  conversion  from  carry -save  to  standard  is 
done  at  the  end  of  the  step  in  time  O  ik).  Note  that  at  any  time  only  one  level  of  the  tree  is  occu¬ 
pied  by  data  generated  by  a  given  bit  position. 

(c)  (BROADCAST  TREES)  Broadcast  trees  BC „ . . .  *BCr  _t ,  are  similar  in  structure  to  the  tally  trees, 
but  different  in  the  functional  capabilities  of  their  nodes.  The  h-th  leaf  of  broadcast  tree  BC  ,  is 
connected  to  the  h-th  input  line  of  decoder  D,  .to  which  the  value  M  ,ih)  mod  r  must  be 

transmitted.  Let  j0  be  such  that  ;V  ^  Mih)  <  (;o+l)r.  Then  leaves  0,1 . j 0-l  of  BCh 

must  receive  the  value  r,  leaf  j0  receives  Mih  ) mod  r .  and  leaves  y0+l,...,rf  —  1  must  receive 
the  value  0.  This  is  done  as  follows.  The  logd  -  \ogn-k  most  significant  bits  of  Ml  hi,  which  are 
indeed  the  binary  expansion  of  jo,  are  used  to  set  leaf  to  receive  the  k  least  significant  bits  of 
Ml  hi  and  to  appropriately  force  all  other ‘leaves.  This  would  be  trivial  if  logn  time  were  allowed 


Binary  Countino 


tree  r  unction. 


for  this  operation.  However,  since  we  only  allow  time  ifc.the  log d  A/S-bits  are  injected  in  parallel 
into  the  root,  and  trace  the  path  to  leaf  ;0  losing  one  (most  significant)  bit  at  each  level;  the  least 
significant  k  bits  follow  serially. 

(d)  (DECODERS)  Decoders  D  -1  .  each  capable  of  computing  the  portion  R,  of  the  output 

sequence  from  the  appropriate  modified  distribution  of  the  entire  input  multiset.  The  I/O  opera¬ 
tions  are  performed  with  a  protocol  similar  to  the  one  used  by  the  encoders. 

An  important  remark  is  that  the  above  network  has  period  k.  Therefore  it  can  be  used  in  a  pipe¬ 
line  fashion  with  this  period.  This  leads  to  the  final  sorting  network.  Letting  d  —  d  xd  2  and  r  ~  rxr2 
(since  d  and  r  are  powers  of  2,  so  are  their  factors!  the  network  has  d  2  encoders  and  d  2  decoders,  each 
with  r  2  input  and  output  lines.  Correspondingly,  there  are  r  2  tally  and  broadcast  trees,  each  with  d  2 
leaves.  In  this  network,  a  given  encoder  will  process  dx  different  multisets  (£y  will  process 
and  a  given  decoder  will  compute  d  j  different  subsequences  of  the  output  (Z?y 
computes  RJJlJ^^...Jij^.2).  Each  "wavefront"  has  a  depth  of  ifc-bits,  so  that  the  period  of  the  net¬ 
work  matches  the  depth  of  each  pipelined  wavefront. 

7.3.4  Area-Time  Performance 

We  shall  focus  on  encoders  and  tally  trees,  since  decoders  and  broadcast  trees  are  analogous. 

An  encoder  with  r2  I/O  lines  can  be  realized  as  a  modification  of  an  (r.logr  0  (logr))-sorter  (see 
Chapter  6),  with  performance  A  *  O (r ? ),  T  -  O (kr lr 2),  for  r 2  in  the  range  'Jkr  ^  r;  ^  r. 

The  tally  tree  structure,  with  d  2  leaves  and  edge-bandwidth  r ,,  can  be  laid  out  in  O  (d  y  f  )  area, 
by  using  the  H-tree  scheme.  This  area  also  accounts  for  the  encoder  modules. 

Finally,  adding  the  contribution  of  the  r  logn-bit  registers  deployed  to  store  the  r  values  of  the 
distribution,  we  obtain  a  global  area 

A  -  Oid  f  +  rlogn  )  7.44 

where  1  <  d-  ^  d . 


The  running  time  is  of  the  form  (for  suitable  constants  C  j  and  C2h 

T  -  C  i(Ckrd  j  +  logd  2).  7.45 

In  fact,  an  encoder  spends  O  ( kr  j)  time  to  process  the  r ,  data  wavefronts  for  each  of  the  d ,  sub¬ 
problems  assigned  to  it.  A  similar  performance  is  achieved  by  the  tally  trees  when  used  in  pipeline, 
with  the  addition  of  the  terms  logd  2  representing  the  depth  of  the  pipe.  Recalling  that 
r  i  =  r  tr  y  d  x  —  d  id  y  and  d  ■  rvr,  7.45  can  be  rewritten  as 

T  a  C  t(C  den  /(r  ji  2)  +  logd  2).  7.46 

At  this  point,  the  analysis  of  the  network  performance  is  complete.  However,  we  can  still  optim¬ 
ize  the  choice  of  r2  and  dy  Formally,  for  each  feasible  value  T  of  the  computation  time,  we  should 
minimize  A  (as  given  by  7.44)  with  respect  to  d2  and  ry  which  are  subject  to  the  appropriate  con¬ 
straints. 

On  an  intuitive  basis  we  expect  the  following  facts.  The  minimum  computation  time  should  be 
achieved  by  the  network  with  the  maximum  degree  of  parallelism,  ue.  with  maximum  r2  and  dy  To 
obtain  slower  networks  we  have  two  possibilities:  one  is  to  slow  down  the  encoders  and  the  decoders 
(by  decreasing  r2).  and  the  other  is  to  decrease  their  number  id  2).  As  long  as  it  is  possible,  we  prefer  to 
decrease  ry  because  the  area  depends  quadratically  on  rj  and  linearly  on  d  2  (see  7.44).  However, 
when  r  2  reaches  its  lower  limit  '/kr  ,  the  only  option  left  is  decreasing  d  % 

Thus,  we  shall  obtain  that  for  fast  computations  the  area  depends  quadratically  on  IT,  and  for 
slow  computations  the  area  depends  linearly  on  1/7*.  This  result  is  not  surprising  since  we  had  already 
found  a  similar  behavior  for  the  lower  bounds. 

Or.  a  more  quantitative  basis  we  introduce  the  variable 

r  -  -  run  !  r ,  C  2kn  /(diogd  ) '  7.47 


and  distinguish  two  cases: 
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1.  r\  >  Jkr  .  In  this  case,  if  we  let  r2  vary  in  the  interval  [Jkr  jr  j  1  while  keeping  fixed  d2-  d , 
we  obtain 

AT2^OidikrY),iorT  €[ft(togn  X(^)l  •  7.4S 

If  we  hold  r  2  fixed  and  equal  to  •Jkr  ,  and  we  let  d  2  vary,  in  the  interval  [\ognJk4\  we  obtain 

AT  =0(d(kry'2\foTT  €(n(V*T ),  0 (* ^  A >/Tzogn  ) )1  7.49 

For  d  2  <  log n/k,  then  the  term  rlogn  prevails  in  the  right-hand  side  of  7.43,  so  that  no  reduction 
in  area  would  result  by  selecting  a  computation  time  larger  than  (Kk  3/2n  X  Jr  logn )  X 

2.  r  2  ^  •Jkr  .  From  7.46  we  can  see  that  this  condition  is  equivalent  to  ( 1  +  C  ■f  )k  <  logn.  In  this 
case  r  is  so  small  that  even  with  the  slowest  encoder  and  decoder,  the  encoding/decoding  time 
would  be  less  than  the  tally/broadcast  time,  if  we  were  to  use  d  leaves  in  the  tree  structures. 
Thus,  we  define  a  value  d2  by  the  equation 

d‘4ogd\  =  C  j •Jkr  n  !r  ,  7.50 

and  we  consider  the  class  of  networks  obtained  when  d  €  [logn  Ik  4\\  while  r2  -  Jrk  .  The 
performance  is 

AT  -  Oid  ( kr  )3/2X  for  T  €[Q(logn  ),  O ik 3/2n  /Jr  logn  )  \  7.51 

The  above  discussion  is  summarized  by  the  following  theorem. 

Theorem  7.2.  An  (n^fej-sorter  can  be  constructed,  for  1  ^  k  $  logn,  with  the  following  perfor¬ 
mance  (r  ~2k  ,d  =  n  /r ,  C  2  a  suitable  constant). 

.AT2  =  OU(krF)  for  T  $[Clilogn)  OiJkr  )\  7.52 

and 

AT  —  Oid  {kr  y112  )  for  T  t[Cl(Jkr  ),  Oik3;zn  /{Jrlogn  ))] .  7.53 

If  ( 1  -t-  C  ir  )  &  ^  logn.  then 

AT  -  Oid  ikr  y2  )  for  T  €{ Clilogn  ),  0 ik  i,2n  /( •Jr  Logn  )  )  ].  7 .54 

Comparing  the  results  of  Theorem  7.2  with  lower  bounds  7.32  and  1.22.  we  can  make  the  following 


i? 


.  -t 


■LJ 


observations.  In  the  range  of  computation  times  where  the  governing  bounds  are  of  the  AT *  form, 
there  is  an  Oik2)  gap.  In  the  range  where  the  governing  bounds  are  of  the  AT  form,  the  gap  is  instead 
Oik212). 


In  fact,  when  manipulating  multisets  of  r  keys,  in  the  form  either  of  lists  or  of  distributions,  our 
circuits  function  on  an  Oikr  )-bit  representation  of  the  multisets,  where  0(r )  bits  are  sufficient  from 
an  information-theoretic  viewpoint. 

One  potentially  useful  modification  is  the  use  of  sorter  for  medium- length  keys,  since  encoders 
and  decoders  are  based  on  (2r,log(2r))-sorters.  However,  this  would  create  a  new  problem,  that  is:  once 
we  keep  the  multisets  of  size  r  encoded  with  O(r)  bits,  it  is  not  immediate  to  see  how  the  multiplicity 
of  different  multisets  can  be  tallied. 

Remark.  The  network  described  in  Section  7.3.3  can  also  be  laid  out  with  all  the  I/O  ports  on  the  boun¬ 
dary.  A  simple  analysis  would  show 

.4  »  0{dz\o9rd^"i  Arlogn),  725 

and  a  result  analogous  to  Theorem  72  can  be  obtained. 


7.4  SORTERS  FOR  LONG  KEYS 

In  this  section  we  derive  upper  bounds  for  the  (n^)-sorting  problem  when  the  keys  are  long,  Le. 
when  k  ^  2  Logn . 

We  summarize  first  what  we  already  know  about  the  problem.  The  case  of  word-local  protocols 
is  easily  taken  care  of.  In  fact,  it  is  not  difficult  to  realize  that  all  constructions  proposed  in  Chapter  6 
achieve  AT 2  —  0(k2n on  keys  of  arbitrary  length  k,  thus  attaining  the  AT-  =  Clik2\2)  lower 
bound  of  Theorem  4.13. 


Thus,  we  turn  our  attention  to  non-word-local  protocols.  It  is  useful  to  define  the  quantity 
3  k  lugn  , 


which,  as  we  shall  soon  see,  plays  an  interesting  role  in  the  sorting  of  long  words.  Considering  that  k  - 
dlogn,  the  lower  bounds  obtained  in  Theorems  4.18  and  4.19  can  be  respectively  restated  as 


AT2  *  Slid  (nlogn  Y)  7J7 

and 

AT  =  n (d  (nlogn  Y/2 ).  7.58 

Furthermore,  Theorems  4.20  and  4.21  tell  us  that 

A  =  SI  ( nlogn  )  7.59 

regardless  of  d  (or  k\  and 

T  -  Cl  ( logn  +  log  k  )  =  ft  (logn  +  log  d).  7.60 

The  performance  of  known  constructions  discussed  in  Chapter  6  is 

.47* 2  =  0(k2n 2)  =  O  ( d2(nlognY )  7.61 


as  we  have  mentioned  above.  Comparing  bounds  7.57  and  7.61  we  see  that  there  is  an  O  id )  gap  so  that 
the  known  designs  are  optimal  only  if  d  =0(1).  The  general  case,  when  d  increases  with  n,  needs 
further  investigation. 

We  shall  present  a  new  design  of  an  (n^)-sorter  whose  performance  comes  very  close  to  the  lower 
bounds  7257  and  7_58. 

7.4.1  A  Non  Word-Local  Sorting  Algorithm 

From  the  preceding  discussion,  it  is  obvious  that  to  improve  the  AT 2  =  0(d 2  (nlogn  Y )  upper 
bound  we  have  to  resort  to  non- word  local  algorithms.  Moreover,  the  form  of  the  lower  bounds,  which 
are  linear  in  d,  suggests  the  decomposition  of  the  problem  in  d  subproblems,  whose  solutions  are  com¬ 
bined  with  small  Information  exchange. 

The  approach  that  we  shall  follow  consists  of  decomposing  the  keys  in  blocks  of  consecutive  bits, 
and  then  processing  together  the  homologous  blocks  of  different  keys.  A  similar  approach  has  been 


considered  by  Leighton/  Some  notation  will  be  useful  for  our  discussion.  (Refer  to  Figure  7.6.) 

For  simplicity  we  assume  that  n  —  2V  (so  that  logn  —  v  is  an  integer)  and  that 
k  —  d  logn  —  d  v,  for  integer  d.  We  observe  that  to  require  that  k/logn  is  an  integer  is  not  a  serious 
constraint,  since  we  can  always  comply  with  it  by  adding  less  than  logn  bit  positions  to  the  keys 
without  changing  the  input  size  significantly. 

Writh  the  above  assumptions,  we  can  partition  each  key  into  d  blocks  of  consecutive  bits.  We 
denote  the  /i-th  (least  significant)  block  of  key  X,  by 

Xlh)  =X!(h+"''-K..Xlhv,  7.62 

for  h  —  0,1, ...,d  —1.  (See  Figure  7.6).  A  similar  partition  can  be  also  considered  for  the  output  keys. 
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}*,  (ft  )  *  Yi {h  +1)l' " 1 . . .  Y,h  ".  7.63 

It  is  obvious  that,  for  given  ft,  (Y 0(h  _,(ft ) )  is  a  permutation  of  (X0 (ft ), . . .  ,X„  _j (ft ) ).  This 

permutation  is  functionally  dependent  on  the  values  of  the  bits  in  blocks  ft.  h+1, . . .  4-1,  as  we  have 
already  seen  in  Section  4.2,  where  the  information  transfer  caused  by  this  dependence  has  been  infor¬ 
mally  called  “secondary  flow". 

The  rank  of  key  X,  in  the  multiset  (X  o, . . .  _]  I  is  the  number  of  keys  in  the  multiset  that  are 

strictly  smaller  than  X, .  Formally 

rank  (X,)=l{/:X,  <  X,  }l.  7.64 

The  following  property  of  the  rank  is  very  useful  for  us.  Suppose  that  each  key  Xo,..,XB_i  is 
viewed  as  the  concatenation  of  three  strings,  namely 

X;  =  L,*  Cf.  R, 

where  *  denotes  string  concatenation.  Moreover  the  number  of  bits  of  L,  is  not  a  function  of  i,  and 
similarly  for  C,  and  R, .  Then  if  we  define 

rank  (C*  )  =  \{j  :Cj  <  C,  1 1  7 .65 

and  we  view  rank  (C, )  as  a  binary  string  of  v  bits,  we  have  that 

rank  (X,)  -  rank  (L,  *  rank  (C,)*  R,).  -  7.66 

Equation  7.66  follows  from  the  fact  that  X,  <  X,  if  and  only  if 
L  .  *  rank  (Cy)»  R;  <  L.  *  rank  ( C , )«  R, ,  whose  proof  is  almost  immediate. 

If  we  consider  the  decomposition 

X,  =  X,  (d-l)»  ...*  X,(ft)«  ...«  X,(0) 

of  the  input  keys,  and  we  repeatedly  apply  Equation  7.66  we  obtain 

rank  (X,)  =  rank  ( rank  (X,  (d -l)  )*...«  rank  (X,(ft))»  ...»  rank  (X,  (0) ) ),  7.67 

which,  in  words,  says  that  the  rank  of  the  concatenation  is  the  rank  of  the  concatenation  of  the  ranks. 


This  property  allows  us  to  reduce  the  computation  of  the  rank  of  long  keys  to  the  computation  of  the 
rank  of  small  substrings  of  the  keys  themselves,  but  we  do  not  know  yet  how  to  compute  the  rank  of 
the  substrings.  This  problem  can  be  solved  by  the  following  procedure,  which  is  based  on  sorting. 

(i)  (EXTEND)  To  compute  the  ranks  of  the  elements  of  IX  oJC  „ . . .  JC„  form  a  new  set  of  keys 

X,  =  X,  *  i  ,  i  »  0, ....  n  —1  where  i  is  represented  with  v  bits. 

(ii)  (SORT)  Sort  |X  •  • .  ♦X*  }  to  obtain  a  sorted  sequence  )’  o, . . .  ,}'„  .  Then 

7,  *XMi)*t Hi),  i  =  0,...,n-l 

where  ?r(0), . . .  ,ir(n  —  l)  is  a  permutation  of  0, ....  n  —1. 

(iii)  (RANK)  Compute  rank  (X  *<,  >)  as  one  plus  the  maximum  index  j  such  that  X  mj)  <  X  ^,  r  If 
no  such  index  exists,  then  let  rank  (X  >)  =  0. 

(iv)  (EXTRACT)  Form  a  new  set  of  keys  Z,  =  irii)*  rankiX „<,))*  X*U)  and  so*"1 

A  A  A 

to  obtain  the  sequence  (Xw . .  -  ,X.  -i)  where  X  *  *  *  ran*  (X. )  *  X, . 

Example.  An  example  will  illustrate  the  ranking  algorithm.  For  simplicity  we  use  digits  instead  of 


(X  X*)  =  (7,6, 1,4,4, 7,9  ) 

(X  o  *  0, . X  „  *  6)  =  (70,61,12,43,44.75.96) 

(X *  tK0), . . . ,  X  *  tK6) )  *  (12,43.44.61,70.75,96) 

(X  rp o)»  •  •  • » X  «(,))  s  ( 1,44,6,7,7,9) 

(ran*  (X  „0>.  •  •  • . ran*  (X  „6) )  )  =  (0,1,1,3,44,6). 

Once  the  ranks  have  been  computed  they  can  be  used  to  son  each  of  the  blocks  (into  which  the  keys 
have  been  partitioned)  independently  of  one  another.  Indeed,  if 

V  =  rank  (X.)*  X,  (h  ), 

ana  £W\„  — H',  _,)  is  the  sorted  sequence  corresponding  to  ;V<lt . ...V,  _j!,  then  it  is  easv  to  see  that 


tl*  =  rcnx  (}',  j  *  T  (n  }. 


Summarizing  the  preceding  discussion,  we  obtain  the  following  divide-and-conquer  sorting  algo¬ 
rithm 

1.  (DIVIDE)  Decompose  the  input  keys  X  <> . . .  ,X„  _j  into  d  blocks  of  v  =  logn  consecutive  bits  each, 
so  that  X,  =  X,  (d  —  1)  *  . . .  *  X/  (h  )*  . . .  *  X,  (0) . 

2.  (SUBPROBLEMS)  For  each  h-0  ....  4- 1.  compute  rank  (X  0(h )),...  sank  (X„  _t(/i ))  with  respect 
to  multiset  (X  0(h  ), . . .  JCn  _j (h  )}. 

3.  (MARRY)  Compute  the  ranks  of  the  X,  ’s  using  Equation  7.66.  More  specifically,  with  the  simpli¬ 
fying  assumption  d  —  2fc,  we  compute  the  right-hand  side  of  7.66  with  a  fully  balanced  tree  of 
operations.  Each  operation  has  two  input  sequences  and  produces  as  output  the  sequence  of  the 
ranks  of  their  concatenation. 

4.  (ROUTE)  Replicate  the  sequence  ( rank  (X  <>),.. .  rank  (X„  _i (h  )  )  d  times  -  one  for  each  block  - 

and  sort  the  sequence  ( rank  (Xj)*  X  £h\...sank  (Xn_i)*  X„_,(/i))  for  h  -  0.1,  .  . .  ,d-l.  to 
obtain  the  seqence  ( rank  (Y  0)  *  Y 0(h  X  •  •  •  rank  ()*„  „,)  *  _,(h  )  ). 

5.  (OUTPUT )  Obtain  the  output  keys  7*  ,> . . .  J'n  _i  as  7*,  =  Y,(d—  I ) »  ...»  ...»  Y,  (0). 

The  algorithm  we  have  just  described  has  a  shortcoming.  In  fact  all  the  input  keys  must  be  read 
(step  1 )  in  order  to  compute  rank  (X  0),.  •  •  sank  (X„  _t)  (steps  2  and  3),  and  no  data  can  be  output  until 
step  5.  This  shortcoming  can  be  eliminated  by  modifying  the  algorithm  according  to  the  observation 
that  to  arrange  in  the  correct  order  the  bits  of  a  given  block  it  is  sufficient  to  know  the  ranks  of  the 
blocks  of  greater  significance. 

We  can  proceed  as  follows.  For  simplicity,  let  d  =  d  -d  ?  Let  us  also  denote  by  X(h)  the  portion 
of  the  array  X  corresponding  to  the  /i-th  block  of  the  keys.  (The  rows  of  XI h)  are  X  ,//»  ), . . .  JC„  _i (h  ).) 
Then,  we  organize  X(d  —  l),X(d  —  2),  ...,X(0)  in  a  d  txd2  array  with  the  index  of  the  block  in  row- 
major  order  (see  Figure 
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Figure  7.7.  Organizing  of  the  input  for  the  pipelined  sorting  algorithm. 


We  propose  a  pipelined  sorting  algorithm  with  d  j  wavefronts  of  data,  each  of  which  is  a  row  of 
the  array  just  defined,  and  consists  of  d  2  blocks. 

The  topmost  wavefront  is  processed  exactly  as  described  in  the  aonpipelined  version  of  the  algo¬ 
rithm.  However,  the  ranks  computed  at  step  3  are  stored  for  later  use.  In  fact  for  the  second  wavefront, 
once  the  ranks  are  computed  they  have  to  be  further  concatenated  with  the  ranks  of  the  first  wave- 
front,  and  the  ranks  of  the  concatenation  will  drive  the  permutation  of  the  data  in  the  second  wave- 
front.  The  computation  proceeds  in  a  similar  fashion  for  the  remaining  wavefronts. 


7.4.2  The  Network 

We  now  describe  a  network  capable  of  executing  the  pipelined  version  of  the  sorting  algorithm 
described  above,  with  efficient  area-time  performance. 


Figure  7.8  shows  a  high  level  representation  of  the  network  consisting  of  two  tree  structures  and 
a  family  of  linear  arrays,  whose  interconnection  and  nodes  are  to  be  described  in  the  following. 

(a)  The  ranking  tree.  This  component  is  a  fully  balanced  binary  tree  on  d  2  leaves,  and  d  2— 1  internal 
nodes.  Both  the  leaves  and  the  internal  nodes  are  essentially  sorting  modules  with  some  further 
capabilities  to  compute  the  ranks  of  a  sequence,  although  a  leaf  module  performs  a  function 


Figure  7.8. 


Structure  of  the  network  for  sorting  a  set  of  long  keys. 


•amindmoD  isx\  jo  £ioaqa  aAisuaqaidmoa 
t  jo  anamdopAap  aqa  oa  aanquauoo  nim  Asqa  atqa  adoq  osjt  a^  -Aaq  aqa  apisaq  pjag  jaqaout 
si  aiaqa  ajaq/a.  spjoaaj  jo  Sumos  jo  SinSjaui  a^q  'Smuos  o:  paatjaa  smaiqojd  jaqio  azAirat  oa  injasn  aq 
A[OTcaJao  HIAV  sisaqa  siqa  m  paanpoaitn  sadaauoa  tejnaaaatqwt  put  ‘3noqauo8ie  ‘punoq-.iamoi  aqj. 

■sajnooaaiqais  JS1A  J°  £ioaqa  antmaosis  put  anaaaqoa  e 
joj  adoq  aiqtuostaj  c  si  ajaqi  “suiaiqoid  jaqao  jo  aoiaiqos  aqa  joj  amatiaan  aqa  m  pasodood  suStsap 
or  pasn  sjiOAnaa  aqa  jo  osit  arua  si  snja  aaurs  *aqn3  £mnq  aqa  put  *aaja  £isuiq  aqa  ‘font  Jtatrq  aqa 
saanamias  oissq  aajqa  jo  saaauidoiaAap  st  paAaiA  aq  to  uiaqa  jo  nt  “s^JOAvaan  jo  £aautA  t  pajapisaoa 
aAEq  am  qSnoqaiv  •sisaqa  siqa  ui  pauiunrxa  sjaojos  jo  spuioj  iwaAas  aqa  jo  aiiaaaaanpjt  aqa  uo  Sutsstidoj 
Aq  aptai  aq  oqt  to  aontandtoos  jsta  MS  asajaaar  i«aaa§  jo  snout  Aiasqo  SunsaJaain  auio$ 

*8ctuos  jo  ^aixaidmoa  anm-tajt  aqa  jo  aoiavzuaa 
owtqa  aqa  aaaidmoo  oa  papaan  si  ^jo/a  jaqoinj  atqa  8uiat3Tpot  *sptmoq  lamoi  nontnassaa-aJtnbs  aqa  Xq 
se  qam  st  sisXibue  aAoqt  aqa  Aq  paosa88ns  asoqa  atqa  ja8jtt  ^taqSqs  ajt  l  .wadtqo  jo  spunoq  jaddn  aqa 
‘paapaj  *maqa  amqo  oa  papaau  asoqa  jo  japjo  amts  aqa  jo  asom  at  aJt  srnaiqojdqns  aqa  oa  suoian{os  aqa 
atnqoaoo  oa  saainosaj  aurta-taot  aqa  atqa  satonssr  ai  asntoaq  aatunxojddt  a{uo  si  stsAitat  aAoqt  aq  r 


•aunSaj  n.v.07S|_  scj  Dinqo  »a  put:  ‘[x~-  :pJct'z  j]j  ^  ‘  v  ^  z  P  ^  I 
jo-j  "3£*c  i  .  jp'fl  —  ncz»r  =  pec  <  p/  j>xe~  ?  =  l"0'  P  =  V  uoracmis  s~a: 

l*i  •  **c :  s  ;  qii.r.  ‘aiqissod  sc  netes  sc  sampoui  au:  azrcaj  01  puc  ‘ainpouj  coca  aq  passaooad  suiai 
-CjOJiqns  ' p,  p  SAcq  01  si  A'Saaiuas  asaq  aqa  uaui  ‘sarnpotc  p  >  ~  p  qaim  paddmba  s;  qjOA.aau  aq: 

•aonSaj  .iscj,  aqa  si  srqj_  ■[”“  2  *  ®"  jij  ; 
joj  (-/  P)U  =  ;2V’  acqa  os  2  -=  j;  up  -  y  .\q  uaAi§  si  qjomaau  aqa  jo  aoucuuouad  iiejjao  aqa  ‘2  anna 
pac  v  cajc  ui  laaiqoidqns  c  Suissaooad  jo  a^qedeo  qoca  “ssinpoui  p  qatm  paddmba  si  qaomaau  aqa  J] 

■(  uSopuf. )  o  = 

pac  ( u2o})0  =  “'“J  \  u2opj)Q  s  /  •sXaq  aqa  jo  sStruasqns  atq-uSoi  a  jo  gmannuad  pac 
Surquiu  ajc  sicaiqoadqns  aSoi/^  -  />  aqa  *sXaai  Suoi  joj  •(  ^ ) Q  =  J  pin:  ‘(  11807) Q  =  J)0  =  / 

*./  azis  jo  saastainui  jo  Suipoosucaa  ajc  smaiqojdqns  u/v  *  p  aqa  *SA.aq  uoqs  joj 

-["i*2  j  •  ““  j]  3  : 

joj  ‘(-/)U  =  - xv  uuoj  aqa  jo  uonciai  c  iiq  paquosap  aiojajaqa  si  ai  pac  moq  uoiaoasiq  Xq  pauruijaaap 
si  Airxajduioo  auiia-cajc  asoqm  ‘adXa  aures  aqa  jo  [tc  *suiaiqojdqns  jo  p  jaquinu  c  oaui  uonisodnxoaap 
e  jiuipc  smaiqoid  aqa  acqa  SoiAjasqo  paonqdxa  aq  ox»  saunSaj  auajaqtp  oau  jo  aouasaod  aqj. 

'  i  IA  oi 

icuoiuodojd  si  samoJio  aqa  jo  caic  aqa  ajaqm  uoiacandxuoo  ascj  joj  auo  pare  ‘x/l  oa  icuoruodojd  (asouqc) 
si  saxnaip  qa  jo  cajc  aqa  ajaqm  uoiaeanduioo  mop  joj  auo  *13101831  icuoiacandmoo  auajagip  oma 
arqiqxa  suSisap  asaqa  ‘paapaj  •sarabraqaaa  uoiac[iassaa  Xq  paumqo  si  aoucouojjad  jraqa  uo  spunoq  aamoj 
aqa  aoots  sXaif  Suoi  puc  aaoqs  ajos  acqa  suXisap  aqa  jo  uoiacuruicxa  aqa  sr  asajaaui  jcjnqiajcd  jo 

•aaucuuojjad  auna-cajc  tcnnado-icau  10  [dorado  qatm  paactamaidun  aq  uco  qatqm 
sraqauo8ic  man  pasodoad  aAcq  am  ssep  qoca  joj  pac  \  uSojz  ^  y)  8uoi  puc  *(  uBojz  >  ?>  udo\) 
qaguai  uinipaui  •(  u2oj  ^  uoqs  s*  sXaai  aqa  pagisscia  3Aeq  a.vi  "qaSuaj  Aawaiqjc  jo  sXaai 
uos  uco  acqa  sainaip  jaqao  joj  siscq  aqa  uaaq  9Acq  (  u2oj)q  +  v2oj  =  ^  joj  sjauos  iciondo  aqx 

•Sutuos  uoiacjauxnua-aSjaoi  puc  8inuo6 
aiuoatq  sc  qans  ‘8muos  pnejed  joj  scuqauo8ic  [coisscp  jo  uoianwxa  aqa  joj  sajnaaaatqwc  aacudojddc 


am  SuidojaAap  Aq  “sauna  uoimnduioo  injSunreaui  jo  aSnti  ajnua  aqi  m  soauos  iaonido  panStsap 
bach  a_\\  -aonnuain  IS! A  am  nr  uonuaiia  aiqejaptsuoo  jo  oaiuao  aq:  oyaqiiq  *(  i £o])q  +  u2o]  =  if 
s;  qioua;  Aaq  aq:  aiaqm  ‘Samos  jo  astro  {Broads  b  jo  Xpms  injaoBo  b  qirm  nnSaq  aABq  a^ 

•spunoq  oaddn  o;  noiiuaua  ono  pauom  aABq  am  sanfaiu 
-qoai  punoq-iamo{  jno  Aq  paomdao  an  uoriBinduioo  aq:  jo  sioadsa  auBAajaj  aq:  nam  moq  aas  ox 

•sauiiSaj  luaoajjrp  om:  siiqtqxa  uoriBindmoo  aq: 
iaq:  os  “saao  isbj  nr  saiBuruiop  Jauuoj  aq:  ajiqm  suormnduioo  mojs  nr  saiBtrrmop  ja:iB{  aqj.  -pamBiqo 
aq  ubo  punoq  jy  ub  pub  ZXY  qioq  “spoom  Suoj  jo  Stnuos  pire  “spoom  uoqs  jo  Surnos  “ijrqs  otpAO 
sb  qons  suiaiqoad  oyroads  joj  “aonsaani  jy  aq:  no  sptmoq  jamoj  spjaiA  uoneiniBS  jo  asnaoaq  aonanassai 
b  jo  [jao  aq:  Aq  paSuaqoxa  uonuuuojm  aq:  jo  sbaibue  aqj.  -aStuo «  AJBJodxca:  jtoj  aiBm  sir  o:  nonatu 
-jojui  auios  spaas  pub  “uonainduioo  aq:  Scunp  suoneooi  domain  nr  dn  sjjg  SJossaocud  om:  aq:  jo  ano 
uaqm  sonooo  “uoiisomBS  sb  o:  pauaja:  par  “a an:  isoq  aqi  joj  aoaq  paoaptsuoo  ‘uxsiuBqoam  jaqiouy 

•aoniEjaiq  luajjno  aq:  jo  worn  err 

anop  st  :i  sb  *|  ft\  jo  suoiiobjj  :imsuoo  axe  lsqi  w  jo  saniBA  o:  uoiiuaua  aq:  Sunoinsu  uaq:  jaqiaj  *w  jo 
uonounj  a  sa  /  jo  aomoruis  aqi  aasSiwaAor  o:  Smisaoann  samooaq  :i  “snqj_  “punoq  jamo{  oaq  aq:  saAiS 
ajojaoaq:  ptra  ‘ «//(  u/^/  sazrumcam  :aq:  xu  jo  an[BA  a  si  aoaq:  uxaiqojd  qoaa  joj  •(  i«/(  ut^  /)u  =  .jy 
sa  pams  aq  uao  punoq  notiBnassa:  axanfas  aq:  moiiBiou  srq:  qiij&  -jaqio  aq:  o:  panStssB  axa  tu-\ft  \ 
pua  jossaoood  ano  o:  paaStssa  an  ft  jo  saiqauBA  ui  uaqm  sjossaoojd  om:  aq:  £q  paSnaqoxa  aonaauojur 
amummu  aq:  (tc///  jCq  aiouap  am  ‘japtsnoo  o:  qsrm  am  saiqauBA  o/I  J°  ias  aq:  si  ft  j j  “jossaoood  oaq:o 
aq:  jo  sajqauBA  indm  aq:  jo  suonoun j  a oa  Jossaocud  ano  jo  sajqBUBA  :nd:no  aq:  jo  auios  uaqm  :nasaod 
si  naaoroojiAaa  jossaoood -om:  aqj  rrr  sooqana  iejoaos  Aq  paivStvaAOi  ^{aAtsuaua  ‘msmaqoam  aoo 

•aSireqoxa  nonaauojnr  aq:  Soiojoj  msraaqoam  [Baona:ndaioo  aq:  aodn  spuadap  anbtu 
-qoai  aonaiiassai  aq:  Aq  panmqo  spnnoq  aqi  jo  nuoj  aqi  iaqi  uaaq  saq  Surpuq  Snrisajaim  £oa a  y 

■spunoq  jaSaoJis  qonm  saptA 

-ood  11  ‘(sjaqio  aobui  £[qaqood  pua)  spjom  Snoi  jo  Smuos  pua  “spoom  uoqs  jo  Suruos  “ijrqs  oqoKo  aqq 
suiatqojd  auios  joj  ■pua  ‘astro  jatoads  a  sa  anbniqoai  uonoasiq  aq:  saumsqns  anbtuqoa:  uonanassa:  ajanbs 


9  LI 


3 in  taut  jo  sens;  it  puctnap  isaiAtaq  aqi  sasod  j:  circsns  o:  Xitssaaau  qipiApctq  aqi  a-iaq.ie.  ;aAa;  ain 
;t>  upucuuojui  si; i  ajr.idcr  01  sn  sa;qtua  szis  |;aa  aqi  10  aaioqa  aitudojddt  try  ’Uo;23-  inoXt]  am 

?it’ jassai  itqi  syjao  s-mbs  jo  las  ajqmins  v  it*  XJtpunoq  au:  ssoxat  paSutqaxa  aoiirffijojui  aui  uo  paste 
•spunoc.  ja^io;  qsnqtisa  01  ;ooi  |ny»Aod  c  ‘anbimiaai  uoutnassa:  ajtr.bs  aq:  paanponui  a.itq  a\\ 

-,:/)U  =  zIY  nuoj 

aqi  aAtq  pat  *qdtaS  uonnndnioo  aqi  jo  uonoasiq  aiqtims  t  ssojoe  paSutuaxa  j  uontuuojo:  aqi  no  pasta 
uaaq  aAtq  spunoq  jaAioi  aum-taxt  •XntuontptJi  -drqa  Jtutid  aq:  ui  uoutoeiojai  jo  Avog  aq:  Xq  paitu; 
-tuop  axt  sDonnnduioD  isx\  1CIT5  JB3I3  uaacl  v-  ^-I03^:  XiixaiduiOD  ISTIA  i°  stnSuo  aqi  aams 

•Xjoaqi  aonnnduioa  ISX\  J°  Pla9  aiT*  103  issj«ot  itiaaaS  jo  saiSoiopoqiata  put  sjooi  jo 
luamdopAap  aqi  01  pta]  osit  stq  11  inq  *AaiA  jo  itnod  jtonotJd  t  mojj  pot  itonajoaqi  t  uiojj  qioq  It: 
-uauitpunj  jtasii  Xq  uiaiqojd  t  ‘Suiuos  jo  Striputisaapun  Jno  pauadaap  stq  sisXltut  jno  Ajno  :o\‘ 

qtAJaim  uota 

joj  aitudojddt  ait  siuaumSit  panoq  -jaAOi  pat  •sajnioairqut  *stnqiuo8it  luajajgip  ltq:  os  ‘tuamouaqd 
luajajpp  Xq  paitinmop  si  XiixaiduKD  atan-taxt  aqi  qon}A  jo  qata  joj  pagnuapi  aq  uta  sqiSuaj 
Xaq  jo  sftAjaioi  aajqi  ista[  it  *paapaj  *amtjp  axe  sXa^  aqi  qatqA  uiojj  asJaAiuri  aqi  jo  azis  aqi  put 
pauos  Suiaq  lasninui  aqi  jo  azis  aAUtjaj  aqi  qiiA  Xiqtxapisuoa  sautA  Suiuos  jo  airtitu  aqi  *uaas  aAtq 
3av  sy  Tiontinduioa  jo  japom  [tpLA  «n  nsXitnt  sii  Xq  pa^taAaj  uaaq  aAtq  raaiqoxd  srqi  jo  siaatj 
Aiau  Xutui  *sapr»p  oai  istd  aqi  or  8utuos  oi  paiOAap  suoiitSnsaAm  aAisuaixa  aqi  jo  aiids  uj 

■suSisap  itumdo-Jtau  jo  ituiiido  pasodojd 
pat  spanoq  J3aioi  ptAuap  aAtq  *  pat  u  ’£  jo  a8uti  axiiua  aqi  joj  •£  auni  ui  uaquina  iiq-^r  a  suos 
itqi  iinaixa  t  moXti  01  pajmbaj  ( £) r  uv  =  y  taxt  uinununn  aqi  patpnis  aAtq  ba  sisaqi  stqi  u| 

SMOISmOMOO 


8  H3XdYHD 


■saum  uonnnduioo  ajqissod  jo  aSnw  amcra  aqi  hi  [eurndo  auna-eair  si  YL 


uiajoaqx  Jo  xaSurqaxa-JOiiurduio:)  aqi  icq:  SAoqs  3^  majoaqj,  J°  iinsai  aq:  qiiA  aosurdmoo  y 

□  ~i  -  v  pne  ‘of  -  p  suonmiisqns  jaije  a%i  tuaij  pameiqo  si  %ri  nouBiay  -Joouj 

U'L  ‘[(*)0T*  ^?)U]9  .1  •»!*((  2/ ?)&>?  UDO  »  V 

laDueuuoj-iad  Suiaohoj  aqi  qiiA  paiannsnoo  aq  nea  jaSneqaxa-ioiexedmoa  y  >•/  vuzuo^ifx 

•Axenoioa  StnasaJtaiui  8uiAOtioj  aqi  spiaiA  Z'L  maJtoaqx  -  P  pne  ‘I  -  wSoj  uaqi4^  -  v 
aaqA  irqi  Suuapisuoa  -snqx  *ajduns  Ama  si  im&tp  Snninsai  aqi  pne  saie8  aidxnis  oi  aiuanaSap 
qiOAian  aqi  jo  sainpoui  aqi  nr  -paapni  -aaSurqaxa-Jcniurdmoa  e  saraoaaq  xauos  aqi  X  m  u  J0d  cm 

■spnnoq  «aoi  aqi  SuiAuap  m  painqdxa 

uaaq  ion  srq  irqi  uonipnoa  r  ‘.Oepunoq  aqi  no  suod  O/I  It®  seq  paquasap  aA«q  ba  xauos  aqx  (t) 

•Suraonuaui  qixoA  axe  uoiiaas  siqi  jo  si[nsai  aqi  no  suotieAxasqo  oax 

•(  p  8o[  +  u2oj)  u  ss  x  s?  uoneindmoa  no  pnnoq  xaAO{  aqi  a)iqA  *(  p2ojv2a})0  =  X  saAaiqae 
nSisap  isaisrj  jno  irqi  si  majqoxd  xaqione  ‘Jiasi!  de8  aqi  sapisag  -iincaidraoa  auni-raxe  aqi  jo  nonezi 
-xaiatxeqa  aiajdtnoa  r  ixmqo  oi  papaan  si  qxoA  xaqunj  *s£aq  8uoi  Sunxos  jo  raaiqoxd  aqi  jo  siSAieue  aqi 
ur  apem  naaq  aAeq  uonoaxrp  iq2u  aqi  in  sdais  isqi  saiesipm  de8  aqi  jfo  azis  nems  aqi  qSnoqqy 

•eaxe  inoAri  aqi  no  pnnoq  xoaoi  ( u2ojv)u  =  y  aqi  smene  Aisnoamn 
-[nuns  u2isap  stqx  -aAoqe  paonon  Apeaxte  aAeq  aA  se  *n8tsap  isoaois  aqi  xoj  (1)0  samoaaq  pur  ‘x  qiiA 
saqsnmmp  qoiqA  *de8  ((JJ  vtoiup  ?)8oj)o  ®r®  si  axaqi  u2oju^  p)Q\  u2opjf>  p  807)0]  3  1 
joj  -spnnoq  jsaoi  pne  xaddn  naaAiaq  d«8  ( ( v2oj/  x\iot)0  *  ( PzSal)0 

ne  si  aaaqi  [(  v2opjA  p  8oj)o‘(u^°l  P  ^°?)U]  9  1  loi  -snoneAxasqo  8m 
-Aonoj  aqi  axem  aeo  ba  qn  pne  in  spnnoq  xaAoi  qiiA  Z’L  ®aioaqx  jo  sitnsaj  aqi  SnuedmoQ 


□  •  [  r  irAxaiui  am  at  axe/.  1  p  Scrnaj  pat  *(  -u2opup  )g  =  c 

Suisooqr  .\q  pacinqo  st  joiAtqaq  jy  aqj_  -payg  ppq  st  ’p  sipqjYi.  X(^)0'(  “uhcru  a  )jj] 
ssatJ  aq:  in  ajea  q  §im:a[  pur  'OUL  noptnbg  01  Suipjocor  *(  ( \  p  v2ojy  %)  2o]  z  =  !  F 
itqi  q?ns  \p  an[tA  aqi  ’p  joj  §uisooqa  \q  paaroqo  st  xoiAtqaq  .jy  aqi  ‘Aijtogtsads 
?J0IV  ‘69 "L  DUC  99 "L  suontnbg  jo  suontindiutu:  aiduns  uo  pastq  AUtnuassa  sr  xooxd  aqj_  'Jocuj 

!  L'L  '  [  (  uZojvf'  P)0‘(  a  p2o.’)U  j  9  X  i0J  (  -.£(  u2ojv)  (  jj  xi2opup  p)  2ojp)  o  =  jy 

put 

9 L'L  *[  (  v2omp  p2oj)  o  ‘(  *2 oj  p2oj)  y  ]  9  £  joj  (  £  uioju)  p.  2ojp )  O  =  Z1V 

:a?uruu ojxad  Sbiahojioj  aqi  qnm  4  u2o]  z  ^  ?  joj  "paimnsaoo  aq  at?  jauos-(y a)  ay  f_  iubm>^£ 

•xaajoaqi  Suimonoj  aqi  Xq  pains  sc  "sJotAtqaq 

aitrpauwaiai  jo  mruiaads  Saiisaxaiai  at  si  ajaqi  aSisap  lsamojs  aqi  pat  isaistj  aqi  uaamiag 

'Sri  ponoq  saAarqDt  11  st  qtunido  paapur  si  aSisap  mo[s  siqx 

SUL  X  j£f7^  P)0  =  1 

( p  =  zp/  p  st  xp  toots)  pat 

PUL  \u2opj)0  »  V 

Aq  aaAiS  st  uSisap  srqi  jo  aootuuojxad  aqj.  -xead 
-dtsip  sjajjnq  a^i  put  apou  ajStns  t  01  sainauaSap  aan  q?ra  itqi  os  x  *  z P  pat  \  n2crpjp. )0  -  9  oaqm 
paAarqot  si  stqj.  '  p  put  q  qioq  jzrununu  01  paau  am  ‘taxt  aqi  aznatana  oi  inrm  am  "prawn  *jj 

£UL  X  p2oi**°l)  O  •  ( ( uS°l/  ?)  *°1  v*°1)  0*1 

put 

ZUL 

£il 


(:«  P)0  *  ( u2oi/.v  do  *  V 


umqo  a  Aft  nioisnpuoo  tq 


t  l’l  x  ( v2°y  i)  2oi/(  v2°v  •?) )  o  =  ( *Boj  lpyi  =  zp 

snqi  it  jo  ucd  laSaiui  aqi  ‘^astaaid  bjoci  ‘jo 

0 L'L  ( (> p  uioi)/  *)  8o[  j  s°l  £  =  1 P 

non 

-tmba  aqi  saqsnus  req:  anpA  aqi 1  p  joj  asooqa  ba  ‘snqj.  *aam  uoiiBinduioa  aqi  jo  japio  aqi  or  nrcS  ou  si 
ajaqi  :  p  So[  z  °Bqi  Jaipms  samaaaq  1  p  uaqAi  ‘jaAoaio[\  ’(%9'iaos)  Bare  aqi  aaojaiaqi  ptre  1  p  sasBamui  1  p 
SuBKuoap  ‘laAamoH  *£  =  1  P  asooqa  pinoqs  9M.  4  u2oj/  %  =  zpxp  nrnensaoi  aqi  Japan  * p  Sot  Z  +  1 P 
aznanmn  oj,  *(  w)g  =  q  sr  bSubj  aiqissnaiad  aqi  ui  anpA  uinaitxetn  aqi  4q  joj  <  p  Soj  z  +  1 P  amunniu 
oi  put  q  azranxetn  oi  aABq  a  Aft.  luqi  reap  sr  i[  -uonnoaxa  apissod  isaisBj  aqi  umqo  oi  axtsap  a it.  reqi 
‘aauBisni  joj  ‘asoddns  -uotitniin  aoios  ap§  oi  japjo  or  sasm  piaads  amos  Suuaptsooa  Aq  tnSaq  a  ^ 

t(Ti)Q<(  vZoTup  )U)  B  <?  pnB  u2oj  ^pl  p-  %  siatBJisaoa  aqi 
Snqpoai  ‘ainBinJojiad  aum  care  aqi  azronido  oi  q  par  z  p  *’  p  joj  sanpA  aqi  loajas  oi  saremaj  i[ 

6 9‘L  X<Uu*°TKiP*o\l  +  'P))0  -  1 

si  amn  nonnndnioo  [Bqo[S  aqi  (siuojjaABA  jo  Jaqmnu  aqi) 4t  p  snjd  *(adid 
aqi  jo  qidap  aqi) 42  p  §o[  z  01  puomodoid  si  sdais  aisoq  jo  Jaqan*n  [Bqo[8  aqi  reqi  Sauapisuo^ 

~u2oj  jo  aidninm  [pais  b  si 

qiSua[  asoqm  sAaq  u  jo  saauanbas  no  pauuojjad  axe  poe  *ad£i  Samos  aqi  jo  [p  are  japreq  AnraonBind 
-moo  are  reqi  suoiiBJado  aqi  •paapo]  •[ ( u) o  X  )u]j  q  reqi  papiAoad  *80111  (q/u2ojufO  °T  panuoj 

-iad  [[a  aq  am  sa[npoai  aqi  jo  snontiado  atscq  aqi  icqi  aas  01  a sea  si  11  *aoni  aonBinduioi  aqi  01  sy 

89 ‘L  (z<izP*0lzP)0  =  V 

eare  nr  ino  pre[  ^[tsra  aq  am  q-iOAuaa  axnna  aqi  *snqj,  -qipiAftpireq  (q)0  W  pazqBai  are  suonaaonoo 
aqi  [[B  isqi  pa?  *Bare  ( q)  ox(  q)0  m  aABq  ^jOAaan  aqi  jo  sappoui  aqi  [[B  reqi  aumsss  Aon  a^ 

aoovouojjad  aunx-Baxy 


-i  to  aouanbas  SinpaodsaAioo  si;;  jo  aun: 
aujcs  au:  :t  aau:  Sumos  aq:  jo  j»;  c  saqoca.;  ^xjtc  ua.M*  b  jo  sa?^  jo  aouanbas  aq:  leq:  aa:uBJcr.S 
suajrnc  aq;  ‘(aouBUiJojjad  oiioijuiasb  aq:  Surname  inoqiiA:  paAarqoe  aq  saeamb  ireo  qoiq.Y. ) 
amt:  a  cues  aq:  ut  uontuado  oiseq  jqaq:  uuojjad  xJOAciau  aq:  jo  sainpoui  aq:  [re  ibu:  Suramssy 

•sainpoui  \—  z  p  So;  -  jo  atuje 

jcainj  c  sc  pa:oatruoo  *suajinq  u:  pa-iois  A;utuoduia:  aoc  ^x>;q  aqi  jo  sXa^  aq:  [c/.jaiui  srq:  SuunQ 
uapjo  [cag  aq:  ur  paSuEiiBaJ  aq  o:  *aaj:  uos-puE-isBopBoaq  aq:  jo  jea;  b  Aq  indui  si  opoiq  auiBS 
aq:  uaq.Y.  auii:  aq:  pat  ‘aaj:  Sut^ubj  aq:  jo  jta[  b  Aq  :ndui  si  ^oojq  aq:  aaq^  aurn  aq:  naaoAiaq 
SmsdB[a  aura  jo  (tAjaitn  ob  si  ajaq:  :eq:  aas  a/c  '{M)X  *»IQ  waiS  b  .xapisuoo  am  J]  -suzjfng  3uj_  (o) 

•nrq:iJoS[B  aq:  jo  f  daw  ui  paquosap  sc  “STfUBJ  Smpuodsaxioo  aq:  o:  Smpioaac  qaojq  ubaiS  b  jo 
sAa:i  aq:  ounnunad  at  swisuoo  uoumado  otsneq  jraqj.  •sainpoui  Samos  Aiicnaassa  ajc  saABai  aqj. 

•jcaj  qoBa  ioj  auo  ‘aiqBiiBAt  axe  :noajaABAi  indtn 
uaAi®  b  o:  Surpuodsauoa  aouanbas  Surquiu  aq:  jo  saidoo  zp  *siaAai  '  j^So;  jaijc  beq:  os  aouanbas 
:ndui  aq:  aiBoqdnp  :cq:  (saajfnq)  sainpoui  aiduiis  aJB  sapoa  iBiuaoui  aqj.  -s^obj  aq:  jo  aouanbas 
aq:  saAiaoa:  :i  qorq.m.  mojj  *aaa:  Sunprej  aq:  jo  :ooj  aq:  0:  pawamioo  si  aau:  srq:  jo  :ooj  aqj. 
■saABai  :puo  aaj:  AjBuiq  paouBieq  iCunj  b  os;b  si  :uauoduxoo  stqj.  -33JJ.  ^uog-pxm-isvopDOjg  aqj  (q) 

•uos  :q8u  aq:  Aq  paonpoud  asoq:  puc  *uos  :jai 
aq:  Aq  paonpood  soprej  aq:  'daw  snoiAaod  aq:  :b  pamduioo  s^uca  aq:  :(:q8u  o:  :jaj  mojj)  ApuiBU 
•saocanbas  aajq:  jo  uoubu3:bouoo  aq:  soprn  qoiqm  ‘ajnpoui  :ooj  aq:  Aq  paABjd  si  aioj  jEioads  y 

•  ■sapoa 

8uudsgo  oau  aq:  Aq  paonpoid  saouaubas  aq:  jo  uotiBuaiBouoo  asiAured  Aq  paomqo  aouanbas 
aq:  or  nuamaia  aq:  jo  squBJ  aq:  8ui:ndmoo  or  swisuoo  apou  icuja:ai  ub  jo  uonejado  oiseq  aqx 

’( (  V)1*  “X )  V**  “•*((  Hf  X) 

•sofUBJ  jraq:  Sunnduioo  jo  puc  *(  y),_  UX'“'\  ¥)’  X  a®  *^oo{q  b  tn  sXaq  aq:  SaiAiaoaj  jo  nstsaoo 
jcai  b  jo  uonciado  oiseq  aq:  ‘Ancogioads  a:o^  -ainpoui  apou-[cuja:ui  ub  buojj  juaja^ip  A[:q8r[s 


REFERENCES 

[AA  80]  H.  A  be  Ison  and  P.  Andreae,  “Information  transfer  and  area-time  trade-offs  for  VLSI  multi¬ 

plication.*  Communications  of  the  ACM,  vol.  23,  n  1,  pp.  20-22:  January  1980. 

[AG  83]  A.  AggarwaL  “On  I/O  placement  in  VLSI  circuits,*  Proc.  21st  Annual  AUerton  Conference 
on  Communication,  Control,  and  Computing,  Monticello,  IL,  pp.  236-243;  October  1983. 

(AKS  83]  M.  Aitai,  J.  Komlos  and  E  Szemeredi.  "An  0(nlogn)  sorting  network,"  Proc.  15th  Annual 
ACM  Symposium  on  Theory  of  Computing,  Boston.  MA,  pp.  1-9;  April  1983. 

[ALT  83]  AVAho,  JJ5.  L'llman  and  \L  Yannakakis,  "On  notions  of  information  transfer  in  VLSI 
circuits."  Proc.  15th  CAM  Symposium  on  Theory  of  Computing,  Boston,  MA  pp.  133-139; 
April  1981 

[Ba  68]  k.  E  Batcher,  "Sorting  networks  and  their  applications,"  Proc.  AF1PS  Spring  Joint  Com¬ 
puter  Conference^ oL  32,  pp.  307-314;  April  1968. 

[Be  64]  V.  E  Benes,  "Optimal  rearrangeable  multi-stage  connecting  networks,"  Bell  Syst.  Tech.  J , 
voL  41  n.  4,  pp.  1641-1656;  July  1964. 

[BG  82]  \L  P.  Brent  and  L  M.  Goldschlager,  "Some  area-time  tradeoffs  for  VLSL"  SIAM  J.  on  Com¬ 
pute  voL  11,  a.  4,  pp.  737-747;  November  1982. 

[BJ  84]  G.  Bilardi  and  X.  Jin,  "Permutation  exchange  graphs  that  emulate  the  binary  cube," 
Mathematical  System  Theory,  voL  17,  n.  1  pp.  193-198;  June  1984. 

(BK  80]  R.  P.  Brent  and  H.  T.  Kung,  "On  the  area  of  binary  tree  layouts,"  Information  Processing 
Letters, voL  11,  n.  1,  pp.  46-48;  August  1980. 

[BK  Si]  R.  P.  Brent  and  H.  T.  Kung,  “The  chip  complexity  of  binary  arithmetic,"  Journal  of  the 
ACM,  voL  28.  n.  1  pp.  521-534;  July  1981. 

[BK  82]  R.  P.  Brent  and  H.  T.  Kung,  "A  regular  layout  fpr  parallel  adders."  IEEE  Trans,  on  Comp 
vol.  C-31,  n.  3,  pp.  260-264;  March  1982. 

[BL  S4]  S  N.  Bhatt  and  F.  T.  Leighton.  "A  framework  for  solving  VLSI  graph  layout  problems."  J. 
of  Comp,  and  Syst.  ScL,  vol.  28,  n.  2,  pp.  300-342:  April  1984. 

[BP  84a]  G.  Bilardi  and  F.  P.  Preparata.  "An  architecture  for  bitonic  sorting  with  optimal  VLSI  per¬ 
formance."  IEEE  Trans.  Comp ,  vol.  C-33,  n.  7,  pp.  646-651;  July  1984. 

[BP  84b]  G.  Bilardi  and  F.  P.  Preparata.  "A  minimum  area  VLSI  network  for  (XlogN)  time  sorting," 
Proc.  16th  Annual  ACM  Symposium  on  Theory  of  Computing,  Washington.  D.  C„  pp.  64- 
70:  April  1 984. 

[BP  S4c]  G.  Bilardi  and  F.  P.  Preparata.  "The  VLSI  optimality  of  the  AKS  sorting  network."  Infor¬ 
mation  Processing  Letters,  to  appear. 

[BPP  82]  G.  Bilardi,  M.  Praccni,  and  F.  P.  Preparata.  "A  critique  of  network  speed  in  VLSI  models  of 
computation."  IEEE  J .  of  Solid-State  Circuits,  vol.  SC-1",  n.  4.  pp.  696-702;  August  1982. 

,BS  S4j  G.  Bilardi  and  M.  Sarrafzadeh.  "Optimal  discrete  Fourier  transform  m  VLSI."  International 
Workshop  on  Parallel  Computing  and  VLSI,  Amalfi.  Italy.  May  1984. 

[CM  81]  B.  Chazelle  and  L.  Momer,  "A  model  of  computation  for  VLSI  with  related  complexity 
results."  Proc.  13th  Annual  ACM  Symposium  on  Theory  of  Computing,  Milwaukee.  \V1, 
pp.  318-325:  May  1981. 

;DGS  84]  P.  Duns,  Z.  Galil  and  G.  Schnitger,  "Lower  bounds  on  communication  complexity,"  Proc. 

16th  Annual  ACM  Symposium  on  Theory  of  Computing,  Washington.  D.  C„  pp.  Sl-41; 
August  1984. 

.GHT  ”9]  L.  J.  Guibas.  H.  T.  Kung  and  C.  D.  Thompson.  "Direct  VLSI  inpiementa  tion  of  combina- 
tor.al  algorithms,"  Proc.  Conference  on  VLSI  .Architecture,  Design.  Fabrication,  Caiii. 
Inst,  of  Techn;  January  1979. 


180 


[Jh  80]  R.  B.  Johnson,  "The  complexity  of  a  VLSI  adder,"  Information  Processing  Letters ,  voL  11, 
n.  2,  pp.  92-93;  October  1980. 

[JK  84]  J.  Ja’  Ja’  and  V.  K.  P.  Kumar,  "Information  transfer  in  distributed  computing  with  applica¬ 
tions  to  VLSI,"  Journal  of  the  ACM,  voL  31,  n.  1,  pp.  150-162;  January  1984. 

[KL  78]  H.  T.  Kung  and  C.  E  Leis.-rson,  "Systolic  arrays  (for  VLSI),"  Symposium  on  Sparse  Matrix 
Computations,  Knoxville,  TN,  pp.  256-282;  November  1978. 

[KLLM  83]  D.  Kleitman,  F.  T.  Leighton,  M.  Lepley,  and  G.  L.  Miller,  "An  asymptotically  optimal  lay¬ 
out  for  the  shuffle-exchange  graph,"  J.  of  Comp,  and  Syst.  ScL,  voL  6,  n.  3,  pp.  339-361; 
June  1983. 

[Km  83]  V.  K.  P.  Kumar,  Communication  Complexity  of  Various  VLSI  Models,  PhD.  Thesis,  Dept, 
of  Comp.  Science,  Pennsylvania  State  University;  August  1983. 

[KJR  82]  K.  Keutzer  and  E  Robertson,  "The  M-shuffle  as  an  interconnection  network  for  SEVID 
machines,"  Proc.  of  20th  Annual  Allerton  Conference  on  Communication,  Control,  and 
Computing,  Monticello,  IL,  pp.  264-271;  October  1982. 

[Ku  82]  H.  T.  Kung,  "Why  systolic  architectures?"  Computer  Magazine,  voL  15,  n.  1,  pp.  37-46; 
January  1982. 

[L  81a]  F.  T.  Leighton,  Layouts  for  the  shu ffie-exchange  graph  and  lower  bound  techniques,  PhD. 
Thesis,  Dept,  of  Mathematics,  MIT,  August  1981. 

[L  81b]  F.  T.  Leighton,  "New  lower  bound  techniques  for  VLSI."  Proc.  22nd  Annual  Symposium  on 
the  Foundations  of  Computer  Science.  Nashville,  TN,  pp.  1-12;  October  1981. 

[L  82]  F.  T.  Leighton,  "A  layout  strategy  which  is  provably  good."  Proc.  14th  Annual  ACM  Sym¬ 
posium  on  Theory  of  Computing,  San  Francisco.  CA.  pp.  85-98;  May  1982. 

[L  83]  F.  T.  Leighton,  "Parallel  computation  using  meshes  of  trees,"  submitted  for  publication. 

[L  84]  F.  T.  Leighton,  "Tight  bounds  on  the  complexity  of  parallel  sorting  *  Proc.  16th  Annual 

ACM  Symposium  on  Theory  of  Computing,  Washington,  D.  C,  pp.  71-80;  April  1984. 

[Lo  83]  \1.  C  Loui.  "The  complexity  of  sorting  on  distributed  systems,"  Tech.  Report  ACT-39, 

Coordinated  Science  Laboratory,  University  of  Illinois,Urbana.  IL;  September  1983. 

[Ls  80a]  C  E  Leiserson,  Area  efficient  VLSI  computation,  PhD.  Thesis,  Dept,  of  Comp.  Science, 
Carnegie-Mellon  University,  November  1980. 

[Ls  80b]  C  E  Leiserson.  "Area-efficient  graph  layouts  (for  VLSI),’  Proc.  2lst  Annual  Symposium  on 
Foundations  of  Computer  Science,  Syracuse,  NY,  pp.  270-281;  October  1980. 

[LS  81]  R.  J.  Upton  and  R.  Sedgewick.  "Lower  bounds  for  VLSL"  Proc.  13th  Annual  ACM  Sympo¬ 
sium  on  Theory  of  Computing,  Milwaukee,  WL  pp.  300-306;  May  1981. 

[Me  83]  K.  Mehlhorn.  ’ AT 1  optimal  VLSI  integer  division  and  integer  square  rooting,"  submitted 
for  publication. 

[MC  79]  C  A.  Mead  and  L.  Conway,  Introduction  to  VLSI  Systems,  Reading,  MA,  Addison- 
Wesley;  July  1979. 

[\IP  75]  D.  E  Muller  and  F.  P.  Preparata.  "Bounds  to  complexities  of  networks  for  sorting  and 
switching,"  Journal  of  ACM,  voL  22,  n.  2,  pp.  195-201;  April  1975. 

[MP  83]  K.  Mehlhorn  and  F.  P.  Preparata.  ".Area-time  optimal  VLSI  integer  multiplier  w;;h 
minimum  computation  time,"  Information  and  Control,  vol.  58,  nos.  1-3.  pp.  137-156:  Julv 
1983. 

[MS  82]  K.  Mehlhorn  and  E  M.  Schmidt.  "Las  Vegas  is  better  than  determinism  in  VLSI  and  distri¬ 
buted  computing,"  Proc.  14th  Annual  ACM  Symposium  on  Theory  of  Computing,  S u- 
Francisco,  CA.  pp.  330-337;  May  1982. 


181 


[Mu  82]  S  Muroga,  VLSI  System  Design,  J.  Wiley,  New  York;  1982. 

[NMB  83]  D.  D.  Nath.  &  N.  Maheshwari  and  P.  C  P.  Bhatt,  "Efficient  VLSI  networks  for  parallel  pro¬ 
cessing  based  on  orthogonal  trees,"  IEEE  Trans,  on  Comp ,  voL  C-32,  a.  6,  pp.  569-581; 
June  1983. 

[NS  79]  D.  Nassau  and  &  Sahni,  "Bitonk  sort  on  a  mesh-connected  parallel  computer."  IEEE 
Trans,  on  Comp,  voL  C-2S,  n.  1,  pp.  2-7;  January  1979. 

[NS  82]  D.  Nassau  and  &  Sahni  "Parallel  permutation  and  sorting  algorithms  and  a  new  general¬ 
ized  connection  network."  Journal  of  ACM,  vol.  20.  n.  3,  pp.  642-667;  July  1982. 

[P  78]  F.  P.  Preparata,  "New  parallel  sorting  schemes,"  IEEE  Trans.  Comp ,  voL  C-27,  n.  7,  pp. 
669-673;  July  1978. 

[P  84]  FJ*.  Preparata,  "VLSI  algorithms  and  architectures."  Proceedings  of  11th  Symposium  on 
Mathematical  Foundations  of  Computer  Science,  Praha,  Czechoslovakia;  September  1984. 

[Pe  77]  M.  C.  Pease.  "The  indirect  binary  n-cube  microprocessor  array,"  IEEE  Trans,  on  Comp , 
voL  C-26,  n.  5.  pp.  458-473;  May  1977. 

[PS  84]  C.  H.  Papadimitriou  and  M.  Sipser,  "Communication  complexity,"  I.  Comput.  System  Set, 
voL  28,  n.  2,  pp.  260-263;  April  1984. 

[PV  80]  F.  P.  Preparata  and  J.  Vuillemin,  "Area-time  optimal  VLSI  networks  for  multiplying 
matrices,”  Information  Processing  Letters,  voL  11,  n.  2,  pp.  77-80;  October  1980. 

[PV  81a]  F.  P.  Preparata  and  J.  Vuillemin,  "The  cube-connected-cycles:  A  versatile  network  for 
parallel  computation,"  Communications  of  the  ACM.  voL  24.  n.  5,  pp.  300-309;  May  1981. 

[PVSlb]  F.  P.  Preparata  and  J.  Vuillemin.  "Area-time  optimal  VLSI  networks  for  computing  integer 
multiplication  and  discrete  Fourier  transform."  Proc.  of  I.C.  A.  L.  P,  Haifa.  IsraeL  pp.  29- 
40;  July  1981. 

• 

[Sa  79]  J.  E  Savage,  "Area-time  tradeoffs  for  matnx  multiplication  and  related  problems  in  VLSI 
models,"  Proc.  of  the  l~th  Annual  AUerton  Conference  on  Communications.  Control,  and 
Computing.  Monticello,  IL,  pp.  670-676;  October  1979. 

[Se  79]  C  L.  Seitz.  “System  timing,"  in  Introduction  to  VLSI  Systems,  C.  Mead  and  L  Conway, 
Eds,  Reading,  MA.  Addison- Wesley,  ch.  7;  July  1979. 

[Sg  84a]  A.  Siegel  "Optimal  area  VLSI  circuits  for  sorting,"  submitted  for  publication. 

[Sg  84b]  A.  Siegel  "Tight  area  bounds  and  provablv  good  AT2  bounds  for  sorting  circuits."  Tech. 

Report  #122  Courant  Institute,  New  York  University;  June  1984. 

[St  71]  H.  SStone.  "Parallel  processing  with  the  perfect  shuffle,"  IEEE  Trans,  on  Comp ,  vol.  C-20. 
n.  2.  pp.  153-161;  February  1971. 

[T  SO]  C.  D.  Thompson,  A  complexity  theory  for  VLSI,  Ph.D.  Thesis.  Dept,  of  Comp.  Science, 
Camegie-Mellon  University;  August  1980. 

[T  3 2a i  C.  D.  Thompson.  "Fourier  transforms  in  VLSI"  IEEE  Trans.  Comp ,  voi.  C-3,  n.  11.  pp. 
1047-1057;  November  1981 

[T  83bJ  C  D.  Thompson.  "The  VLSI  complexity  of  sorting,"  IEEE  Trans.  Comp ,  vol.  C-32.  n.  12. 
pp.  1171-1154;  December  1983. 

[TX  77]  C.  D.  Thompson  and  H.  T.  Kung,  "Sorting  on  a  mesh -connected  computer."  Communications 
of  the  ACM,  vol  20,  a.  4,  pp.  263-271;  April  1977. 

.U  S3]  J.  D.  Ullman.  Computational  Aspects  of  VLSI,  Computer  Science  Press:  1983. 

i.Va  Si]  L  G.  Valiant.  "Universality  considerations  in  VLSI  circuits."  IEEE  Trans,  on  Comp,  voi. 
C-30,  n.  2.  pp.  135-140;  February  1981. 

[Vu  S3]  J.  Vuiilemin,  "A  combinatorial  limit  to  the  computing  power  of  \LSI  circuits."  IEEE 
Trans,  on  Comp,  vol  C-32,  n.  3.  pp.  294- 30C;  March  1983. 


Gianfranco  Bilardi  was  born  in  Reggio  Calabria,  Italy,  on  March  8,  1956.  He  received  the  Laurea 
in  lngegneria  Elettronica  degree  from  Universita  di  Padova  in  1978  and  the  M.S.  degree  in  Electrical 
Engineering  from  the  University  of  Illinois  at  Urbana-Champaign  in  1982,  where  he  was  a  Research 
Assistant  in  the  Coordinated  Science  Laboratory  until  1984.  He  was  awarded  an  International  Rotary 


Fellowship  in  1980.  and  an  IBM  Graduate  Fellowhip  in  1982  and  1983. 


END 

* 

FILMED 

hS6 


DTIC 


